pytorch

mirror of https://github.com/pytorch/pytorch.git synced 2025-10-21 21:49:24 +08:00

Author	SHA1	Message	Date
drisspg	a5e8b0ad38	Trying to reduce flash-deps ghstack-source-id: 8ba7b23dfde594e126977930e54395405573a598 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144120	2025-01-09 16:40:01 -08:00
Xuehai Pan	dcc3cf7066	[BE] fix ruff rule E226: add missing whitespace around operator in f-strings (#144415 ) The fixes are generated by: ```bash ruff check --fix --preview --unsafe-fixes --select=E226 . lintrunner -a --take "RUFF,PYFMT" --all-files ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/144415 Approved by: https://github.com/huydhn, https://github.com/Skylion007	2025-01-08 21:55:00 +00:00
titaiwangms	a742859fc2	[ONNX] Update images and APIs to onnx_dynamo.rst (#144358 ) Update the result image of exporting, and delete the functions/class that belongs to `torch.onnx.dynamo_export` Pull Request resolved: https://github.com/pytorch/pytorch/pull/144358 Approved by: https://github.com/justinchuby, https://github.com/malfet	2025-01-08 21:44:43 +00:00
Brian Muse	a5164a2b18	[BE] Clean up ExecuTorch Export Docstring (#141490 ) Summary: I noticed when looking at the docs for [`torch.export.load`](https://pytorch.org/docs/stable/_modules/torch/export.html#load) that it looked like there was a copy and paste error from the save command docstring since ep is not an actual parameter for load and it says "The exported program to save." This diff removes it from the docstring. Test Plan: Automated Testing Pull Request resolved: https://github.com/pytorch/pytorch/pull/141490 Approved by: https://github.com/JacobSzwejbka	2025-01-08 21:28:58 +00:00
Will Constable	8c5d992772	[Pipelining] Refactor pp composability test to use faster MPCT (#144345 ) * Using MultiProcessContinuousTest base class is faster (60s vs 279s for the full run of `test_manual_with_data_parallel` and all its parametrizations * Have to move to a new file to use MPTC since it requires a different launcher style in `__main__` * Propose to reorganize the composability tests anyway, since `test/_composable/test_composability/test_pp_composability` is an annoyingly long path * rename `test_manual_with_data_parallel` to `test_pp_dp` for simplicity/consistency with newer test names. (manual refers to not using tracer frontend, but that's not so important to call out in the test name) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144345 Approved by: https://github.com/H-Huang, https://github.com/mori360	2025-01-08 20:50:12 +00:00
LlamaFarm	c194e5c986	Remove extra copy torch/_prims (#144407 ) updated _reshape_aten Pull Request resolved: https://github.com/pytorch/pytorch/pull/144407 Approved by: https://github.com/awgu	2025-01-08 20:14:48 +00:00
Randolf Scholz	628acc4ace	`Dirichlet.mode`: use `dim=` instead of `axis=` (#144402 ) `axis=` is undocumented and will raise typing errors when #144197 is merged. See: https://github.com/pytorch/pytorch/pull/144197#pullrequestreview-2537398866 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144402 Approved by: https://github.com/Skylion007	2025-01-08 20:14:01 +00:00
Natalia Gimelshein	ab1f627aa4	fix randint distribution for large max (#143787 ) Fixes #ISSUE_NUMBER Similar to #143682, for large maximum values we were sampling integers via % and it doesn't provide uniform distribution. Here we limit the max skew to approx 1% (random32 is used for max values `<= 2**32 / 128`) This comes with significant perf penalty, especially for cuda, but it's a pretty bad bug, so we'll have to figure out what can be done to improve it. `torch.compile` has always been producing correct results for this, and it's performance is also significantly better than current eager (eager is ~660 GB/s on H100, torch.compile 1200 GB/s), so we have to figure out why torch.compile is better. `__launch_bounds__` slightly regress perf, so perhaps we can figure out how to specify them better, but it's only 20-30 GB/s, so the big difference is still unexplained. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143787 Approved by: https://github.com/eqy	2025-01-08 18:51:48 +00:00
Shangdi Yu	0e1675a89b	Relax aten.to restriction (#142420 ) Summary: if we have a.to(b), and b has a different dtype with a, then it must be a copy. In this case, we do not need to freeze the tensor. Instead, we use torch.ops.aten._assert_tensor_metadata.default to ensure that a must not have the same dtype as b. Fixes https://github.com/pytorch/pytorch/issues/139718 Update executorch pin to include https://github.com/pytorch/executorch/pull/7277. Test Plan: ``` buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test:test_export -- -r test_float_conversion buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test:test_export -- -r test_device_to_mutation_float ``` Differential Revision: D66988295 Pull Request resolved: https://github.com/pytorch/pytorch/pull/142420 Approved by: https://github.com/bdhirsh	2025-01-08 18:11:31 +00:00
Randolf Scholz	768d73f692	use `torch.special.xlogy` to implement `x_log_x` (#144220 ) Fixes #144279 Using `x* x.log()` does not produce the correct value when `x=0`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144220 Approved by: https://github.com/Skylion007	2025-01-08 17:41:55 +00:00
cyy	d0070ca07e	[18/N] Fix extra warnings brought by clang-tidy-17 (#144014 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/144014 Approved by: https://github.com/Skylion007, https://github.com/albanD	2025-01-08 17:21:55 +00:00
Aaron Gokaslan	373541fbf4	[BE]: Remove unnecessary copy of gradients in util (#144329 ) No need to copy gradients to CPU too Pull Request resolved: https://github.com/pytorch/pytorch/pull/144329 Approved by: https://github.com/awgu, https://github.com/cyyever	2025-01-08 16:52:15 +00:00
atalman	e14c36d3f4	Set maximum supported version of Python as 3.13 (#144396 ) Same as https://github.com/pytorch/pytorch/pull/119743 Required for Release 2.6.0 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144396 Approved by: https://github.com/Skylion007, https://github.com/albanD, https://github.com/malfet	2025-01-08 16:16:10 +00:00
Xinya Zhang	3068ce0337	ROCm SDPA: Ensure attn_mask has the same dtype with q (#143242 ) This is required by current AOTriton's backend. Fixes NaN when calling SDPA ME backend with `q.dtype() != attn_mask.dtype()` when training llama2 using transformers+deepspeed+pytorch Corresponding CUDA check seems to be here: `708ce3c008/aten/src/ATen/native/transformers/cuda/attention.cu (L1331-L1336)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/143242 Approved by: https://github.com/jeffdaily	2025-01-08 15:20:26 +00:00
Nikita Shulga	708ce3c008	Add `is_dtype_supported` predicate to DeviceInterface (#144355 ) Which will return true, unless dtype is bf16 by default For MPS device it will return false if dtype is double Check that it works by refactoring `test_inf` that should expect TypeError raised if invoked with unsupported dtype Pull Request resolved: https://github.com/pytorch/pytorch/pull/144355 Approved by: https://github.com/jansel, https://github.com/dcci	2025-01-08 13:59:46 +00:00
Davide Italiano	8fc0ffe54b	[mps/inductor] Add support for rsqrt(). (#144374 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144374 Approved by: https://github.com/malfet	2025-01-08 13:58:05 +00:00
William Wen	f700035090	[3.13t] use sysconfig to check for Python nogil builds (#144361 ) `sys._is_gil_enabled()` wasn't working in certain cases, according to @atalman Pull Request resolved: https://github.com/pytorch/pytorch/pull/144361 Approved by: https://github.com/atalman	2025-01-08 13:00:32 +00:00
George Wigley	a5051a9521	Update torch.masked.mean to upcast dtype for bool tensors (#139999 ) When calling `torch.masked.mean(...)` with a boolean tensor, the dtype is inferred to be bool. When the mean is being computed, the sum operator is used. When the sum operator is used with dtype=torch.bool, the result is clamped to True (1) leading to an incorrect mean being calculated. The below example shows how the incorrect result occurs: ``` a = torch.tensor([True, True]) count = torch.sum(torch.ones(a.shape, dtype=torch.int64)) # 2 total = torch.sum(a, dtype=torch.bool) # True (1) mean = total / count # 0.5 ``` This PR upcasts the dtype used for the sumation to int32 in the case of bool tensors allowing for the correct result to be computed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139999 Approved by: https://github.com/cpuhrsch	2025-01-08 10:35:19 +00:00
Xiaodong Wang	60a505022f	[AMD] SDPA internal changes (#144320 ) Summary: All the internal changes needed to enable flash attention w/ SDPA in fbcode. Test Plan: ``` TORCH_ROCM_FA_PREFER_CK=1 buck run -m rocm621 mode/opt-amd-gpu scripts/xdwang/example:sdpa +--------------+-------------------+---------+------------+-------------------+------------------+----------------------+----------------+---------------+-------------------+------------------------+---------------------------+----------------------+-------------------+ \| Batch Size \| Sequence Length \| Heads \| Head Dim \| Flash Time (µs) \| Math Time (µs) \| xformers Time (µs) \| Flash TFlops \| Math TFlops \| xformers TFlops \| Speedup (Flash/Math) \| Speedup (xformers/Math) \| xformers trace_url \| Flash trace_url \| +==============+===================+=========+============+===================+==================+======================+================+===============+===================+========================+===========================+======================+===================+ \| 1 \| 4096 \| 32 \| 64 \| 455.552 \| 7748.76 \| 513.449 \| 301.698 \| 17.7369 \| 267.678 \| 17.0096 \| 15.0916 \| \| \| +--------------+-------------------+---------+------------+-------------------+------------------+----------------------+----------------+---------------+-------------------+------------------------+---------------------------+----------------------+-------------------+ \| 1 \| 4096 \| 16 \| 128 \| 329.971 \| 4741.11 \| 386.049 \| 416.519 \| 28.9888 \| 356.014 \| 14.3683 \| 12.2811 \| \| \| +--------------+-------------------+---------+------------+-------------------+------------------+----------------------+----------------+---------------+-------------------+------------------------+---------------------------+----------------------+-------------------+ \| 1 \| 8192 \| 32 \| 64 \| 1455.76 \| 31869.6 \| 1665.49 \| 377.642 \| 17.2501 \| 330.087 \| 21.8921 \| 19.1353 \| \| \| +--------------+-------------------+---------+------------+-------------------+------------------+----------------------+----------------+---------------+-------------------+------------------------+---------------------------+----------------------+-------------------+ \| 1 \| 8192 \| 16 \| 128 \| 1265.77 \| 18972.8 \| 1479.48 \| 434.325 \| 28.976 \| 371.588 \| 14.9891 \| 12.824 \| \| \| +--------------+-------------------+---------+------------+-------------------+------------------+----------------------+----------------+---------------+-------------------+------------------------+---------------------------+----------------------+-------------------+ \| 1 \| 16384 \| 32 \| 64 \| 5732.99 \| 121861 \| 6816.77 \| 383.573 \| 18.0453 \| 322.59 \| 21.2562 \| 17.8767 \| \| \| +--------------+-------------------+---------+------------+-------------------+------------------+----------------------+----------------+---------------+-------------------+------------------------+---------------------------+----------------------+-------------------+ \| 1 \| 16384 \| 16 \| 128 \| 4749.69 \| 73776.4 \| 5404.03 \| 462.982 \| 29.8066 \| 406.923 \| 15.5329 \| 13.6521 \| \| \| +--------------+-------------------+---------+------------+-------------------+------------------+----------------------+----------------+---------------+-------------------+------------------------+---------------------------+----------------------+-------------------+ +--------------+-------------------+---------+------------+-------------------+------------------+----------------------+----------------+---------------+-------------------+------------------------+---------------------------+----------------------+-------------------+ \| Batch Size \| Sequence Length \| Heads \| Head Dim \| Flash Time (µs) \| Math Time (µs) \| xformers Time (µs) \| Flash TFlops \| Math TFlops \| xformers TFlops \| Speedup (Flash/Math) \| Speedup (xformers/Math) \| xformers trace_url \| Flash trace_url \| +==============+===================+=========+============+===================+==================+======================+================+===============+===================+========================+===========================+======================+===================+ \| 1 \| 4096 \| 32 \| 64 \| 1615.41 \| 8342.67 \| 1822.72 \| 212.7 \| 41.1855 \| 188.508 \| 5.16443 \| 4.57705 \| \| \| +--------------+-------------------+---------+------------+-------------------+------------------+----------------------+----------------+---------------+-------------------+------------------------+---------------------------+----------------------+-------------------+ \| 1 \| 4096 \| 16 \| 128 \| 1357.97 \| 5943.53 \| 1432.34 \| 253.022 \| 57.8104 \| 239.886 \| 4.37676 \| 4.14953 \| \| \| +--------------+-------------------+---------+------------+-------------------+------------------+----------------------+----------------+---------------+-------------------+------------------------+---------------------------+----------------------+-------------------+ \| 1 \| 8192 \| 32 \| 64 \| 5556.5 \| 31726.7 \| 6502.17 \| 247.348 \| 43.3197 \| 211.374 \| 5.70984 \| 4.8794 \| \| \| +--------------+-------------------+---------+------------+-------------------+------------------+----------------------+----------------+---------------+-------------------+------------------------+---------------------------+----------------------+-------------------+ \| 1 \| 8192 \| 16 \| 128 \| 5186 \| 22529.4 \| 5590.36 \| 265.019 \| 61.0044 \| 245.85 \| 4.34427 \| 4.03004 \| \| \| +--------------+-------------------+---------+------------+-------------------+------------------+----------------------+----------------+---------------+-------------------+------------------------+---------------------------+----------------------+-------------------+ \| 1 \| 16384 \| 32 \| 64 \| 22527.7 \| 130413 \| 26527.6 \| 244.035 \| 42.155 \| 207.239 \| 5.789 \| 4.91613 \| \| \| +--------------+-------------------+---------+------------+-------------------+------------------+----------------------+----------------+---------------+-------------------+------------------------+---------------------------+----------------------+-------------------+ \| 1 \| 16384 \| 16 \| 128 \| 18347.9 \| 87553.2 \| 20358 \| 299.628 \| 62.791 \| 270.044 \| 4.77184 \| 4.30068 \| \| \| +--------------+-------------------+---------+------------+-------------------+------------------+----------------------+----------------+---------------+-------------------+------------------------+---------------------------+----------------------+-------------------+ ``` Reviewed By: leitian, feikou, yoyoyocmu, sijiac Differential Revision: D67262726 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144320 Approved by: https://github.com/jianyuh, https://github.com/eqy, https://github.com/leitian	2025-01-08 09:29:28 +00:00
PyTorch MergeBot	7d9f26de05	Revert "Unskipped multiple inductor tests for ROCm (#143581 )" This reverts commit e05d67790ee4a53c310322829631c000f0ac2985. Reverted https://github.com/pytorch/pytorch/pull/143581 on behalf of https://github.com/huydhn due to There is some tests failing on ROCm jobs in trunk ([comment](https://github.com/pytorch/pytorch/pull/143581#issuecomment-2577163274))	2025-01-08 09:15:14 +00:00
Davide Italiano	aaf56152ea	[cpu/sorting] Throw an error when trying to sort complex numbers. (#144113 ) It doesn't really make sense to sort complex numbers as they are not comparable. Fixes #129296 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144113 Approved by: https://github.com/malfet	2025-01-08 05:15:36 +00:00
titaiwangms	78eded8e00	[ONNX] Use torch.export.Dim.AUTO in dynamo_export (#144356 ) Align to the changes in https://github.com/pytorch/pytorch/pull/143158 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144356 Approved by: https://github.com/justinchuby	2025-01-08 05:00:16 +00:00
bobrenjc93	90e81a157a	Migrate from Tuple -> tuple in torch/utils/data (#144255 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144255 Approved by: https://github.com/andrewkho	2025-01-08 04:09:45 +00:00
Animesh Jain	8ccf3f6f3f	[dynamo][easy] Move dict tests to test_dicts.py (#144165 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144165 Approved by: https://github.com/jansel ghstack dependencies: #143997	2025-01-08 03:56:33 +00:00
Animesh Jain	2ac41404a8	[dynamo][dicts] Guarding lazily on dict keys (#143997 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143997 Approved by: https://github.com/jansel	2025-01-08 03:56:33 +00:00
iupaikov-amd	e05d67790e	Unskipped multiple inductor tests for ROCm (#143581 ) All of them should be fine to run now after the triton fix. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143581 Approved by: https://github.com/jataylo, https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-01-08 03:55:33 +00:00
CaoE	28b4992e7a	Set prop_kind to forward_inference when grad is not needed for mkldnn_convolution_pointwise (#142855 ) `prop_kind` of MKLDNN convolution is always `dnnl_forward`, i.e., `dnnl_forward_training` , regardless of whether grad is needed. Setting `prop_kind` to `dnnl_forward_inference` for mkldnn_convolution_pointwise could have better performance. Pull Request resolved: https://github.com/pytorch/pytorch/pull/142855 Approved by: https://github.com/jgong5	2025-01-08 02:22:06 +00:00
Xia, Weiwen	f8fcb9e7d3	[Quant][Inductor][X86] Separate unary post op fusion and lowering for qlinear (#143903 ) Summary The current implementation fuses quantized ops and their post ops and lowers the fused the op to cpp backend in the same pass. It is better to separate post op fusion and lowering because - it looks better in terms of design - we need the post op fusion pass for PT2E quantization eager mode This PR is the first of a series of PRs which separate post op fusion and lowering for quantized linear and convolution. It moves unary post op fusion of qlinear out of the lowering pass. This PR moves the fusion pass from the lowering pass to after the weight-prepack pass. The workflow is 1. Weight prepack for qlinear so that `dq - linear` patterns are replaced by `onednn.qlinear_pointwise` 2. Fuse `onednn.qlinear_pointwise` and post ops 3. Lower to cpp backend This PR adds additional `PatternMatcherPass`'s to handle the post op fusion. Pattern matchers used for fusion are reused. Test plan It is covered by existing UTs in `test_mkldnn_pattern_matcher.py` for post op fusion. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143903 Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5, https://github.com/jerryzh168	2025-01-08 01:55:53 +00:00
zeshengzong	094ca3154d	Fix torch._refs.tensor error with empty list (#143461 ) Fixes #143216 Test Result Before ```python >>> import torch >>> torch._refs.tensor([]) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/home/zong/code/pytorch/torch/_refs/__init__.py", line 6614, in tensor new_tensor = _internal_new_from_data( ^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/zong/code/pytorch/torch/_refs/__init__.py", line 6596, in _internal_new_from_data tensor = _recursive_build(inferred_scalar_type, data) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/zong/code/pytorch/torch/_refs/__init__.py", line 6545, in _recursive_build return torch.stack([_recursive_build(scalarType, item) for item in seq]) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ RuntimeError: stack expects a non-empty TensorList ``` After ```python >>> torch._refs.tensor([]) tensor([]) >>> torch._refs.tensor([], device='cuda') tensor([], device='cuda:0') ``` ```bash $ pytest test/test_tensor_creation_ops.py -k test_refs_tensor ``` ![image](https://github.com/user-attachments/assets/5be4c17a-bea6-4b7b-bec1-b4fcb417a8cd) ```bash $ lintrunner ``` ![image](https://github.com/user-attachments/assets/e8f88f41-78ac-4337-b53f-2e524de2bec0) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143461 Approved by: https://github.com/ezyang, https://github.com/soulitzer	2025-01-08 01:29:00 +00:00
Eddie Yan	9e6a6389ce	[functorch] clean up asserts in `test_dims.py` (#144276 ) For better debuggability of issues encountered in e.g., #141730 when trying to migrate to python 3.12/3.13 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144276 Approved by: https://github.com/Skylion007	2025-01-08 01:21:40 +00:00
Lu Fang	013c796b1e	Eliminate c10::optional usage in PyTorch (#144346 ) Differential Revision: D67907427 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144346 Approved by: https://github.com/hl475	2025-01-08 01:14:04 +00:00
Randolf Scholz	f002825e1e	added `__add__` and `__mul__` hints to torch.Size (#144322 ) Fixes #144218 `Size` returns `Size`, whereas `tuple` returns `tuple`: `9f28171658/stdlib/builtins.pyi (L985-L988)` - Use `SupportIndex` instead of `int` in `__getitem__` (supported at runtime) - `Size.__add__` overrides `tuple.__add__`, the latter supports adding tuples on non-integral type. - Added typing unit tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144322 Approved by: https://github.com/Skylion007	2025-01-08 01:02:11 +00:00
xinan.lin	06ea81336f	[Inductor UT] Remove excepted failure for aoti test_fft_c2c (#144238 ) Since #143223 enabled runtime dispatch for fft_c2c in AOTI mod, for XPU, we can fallback fft_c2c which has no XPU implementation to CPU and pass the case now. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144238 Approved by: https://github.com/jansel	2025-01-08 00:49:32 +00:00
Wanchao Liang	96f4abba17	[dtensor] move all tests to distribute/tensor folder (#144166 ) as titled, mainly moving files Pull Request resolved: https://github.com/pytorch/pytorch/pull/144166 Approved by: https://github.com/Skylion007	2025-01-08 00:32:33 +00:00
Justin Chu	7c9cf287c2	[ONNX] Handle list values as 0d inputs (#144343 ) Handle list values as 0d inputs instead of 1d, as the `SymInt`s are expected to be 0d tensors in ONNX. This PR reshapes int64 values into 1D tensors in a list, assuming they are 0D tensors initially. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144343 Approved by: https://github.com/gramalingam, https://github.com/titaiwangms	2025-01-08 00:15:50 +00:00
Oguz Ulgen	9ee242213b	[RFC] Introduce cache hot loading APIs (a.k.a. "Mega-cache") (#143341 ) This PR essentially introduces two new APIs * torch.compiler.save_cache_artifacts * torch.compiler.load_cache_artifacts which aim to create a mega cache experience where the user can start collecting cache artifacts, and later call the save API to fetch them. In the next attempt, the user can "hot load" the cache artifacts via the load function. This bundling approach reduces the need to rely on porting individual files one by one, or relying on many network requests. Note that these APIs CANNOT log to structured logging as these functions will be called before and after compilation, as opposed to during compilation. Due to this limitation, the API returns a struct that the user can log with. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143341 Approved by: https://github.com/jansel	2025-01-07 23:13:24 +00:00
Stacie-Herda	c2c50d5f00	Fixed doc where more than one device specified since only one device is used (#17553 ) (#144043 ) Fixes #17553 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144043 Approved by: https://github.com/soulitzer	2025-01-07 23:06:52 +00:00
Yanbo Liang	430d54ee20	[Dynamo] Add functorch C++ bindings as in graph functions (#144309 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144309 Approved by: https://github.com/williamwen42 ghstack dependencies: #144306, #144307, #144308	2025-01-07 22:25:01 +00:00
Yanbo Liang	d146763f6f	[Dynamo] Inline functions in torch._ops (#144308 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144308 Approved by: https://github.com/williamwen42 ghstack dependencies: #144306, #144307	2025-01-07 22:25:01 +00:00
Yanbo Liang	242a4a3f83	[Dynamo] Inline functions in torch._functorch.pyfunctorch (#144307 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144307 Approved by: https://github.com/williamwen42 ghstack dependencies: #144306	2025-01-07 22:24:53 +00:00
Yanbo Liang	4417be65e5	[Dynamo] Inline functions in torch._functorch.autograd_function (#144306 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144306 Approved by: https://github.com/williamwen42	2025-01-07 22:24:46 +00:00
Richard Barnes	3beb7006dd	c10::optional -> std::optional in a few places (#144340 ) Test Plan: Sandcastle Pull Request resolved: https://github.com/pytorch/pytorch/pull/144340 Approved by: https://github.com/malfet	2025-01-07 21:09:39 +00:00
Simon Fan	f4969c8235	fix torch.compile + ddp + non-reentrant AC pack hook firing count (#144271 ) FIXES https://github.com/pytorch/pytorch/issues/144035 In order to preserve hook firing semantics, we disabled pack/unpack hooks for torch.compile: https://github.com/pytorch/pytorch/pull/123196. In DDP under torch.compile, there's this other callsite that we need to disable hooks for Pull Request resolved: https://github.com/pytorch/pytorch/pull/144271 Approved by: https://github.com/bdhirsh, https://github.com/soulitzer	2025-01-07 21:08:52 +00:00
zeshengzong	861b65fe74	[Easy] Fix linalg.norm hint message typo (#144323 ) Fixes #136454 Test Result Before ```python >>> import torch >>> from torch import linalg >>> >>> my_tensor = torch.tensor([[[8., -3., 0., 1.]]]) >>> # ↓ ↓ ↓ ↓ ↓ >>> linalg.norm(input=my_tensor, ord='fro', dim=(0, 1, 2)) # Error Traceback (most recent call last): File "<stdin>", line 1, in <module> RuntimeError: linalg.norm: If dim is specified, it mut be of length 1 or 2. Got [0, 1, 2] >>> # ↓ ↓ ↓ ↓ ↓ >>> linalg.norm(input=my_tensor, ord='nuc', dim=(0, 1, 2)) # Error Traceback (most recent call last): File "<stdin>", line 1, in <module> RuntimeError: linalg.norm: If dim is specified, it mut be of length 1 or 2. Got [0, 1, 2] ``` After ```python >>> import torch >>> from torch import linalg >>> >>> my_tensor = torch.tensor([[[8., -3., 0., 1.]]]) >>> # ↓ ↓ ↓ ↓ ↓ >>> linalg.norm(input=my_tensor, ord='fro', dim=(0, 1, 2)) # Error Traceback (most recent call last): File "<stdin>", line 1, in <module> RuntimeError: linalg.norm: If dim is specified, it must be of length 1 or 2. Got [0, 1, 2] >>> # ↓ ↓ ↓ ↓ ↓ >>> linalg.norm(input=my_tensor, ord='nuc', dim=(0, 1, 2)) # Error Traceback (most recent call last): File "<stdin>", line 1, in <module> RuntimeError: linalg.norm: If dim is specified, it must be of length 1 or 2. Got [0, 1, 2] ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/144323 Approved by: https://github.com/Skylion007, https://github.com/soulitzer	2025-01-07 20:34:16 +00:00
Simon Fan	d38af6e8bc	[ca] dedup node names when AOT bwd graph is reused multiple times (#144202 ) This error started popping up in HUD CA benchmarks: ```python File "/data/users/xmfan/core/b/pytorch/torch/_dynamo/compiled_autograd.py", line 371, in dce self.fx_tracer.graph.eliminate_dead_code(is_impure) File "/data/users/xmfan/core/b/pytorch/torch/fx/graph.py", line 1862, in eliminate_dead_code self.lint() File "/data/users/xmfan/core/b/pytorch/torch/fx/graph.py", line 1753, in lint raise RuntimeError(f"Node redefined name {node.name}!") RuntimeError: Node redefined name aot0_expand! ``` We added CA initial capture's renaming (https://github.com/pytorch/pytorch/pull/133148) to help debug issues with AOT backward, but it errors out when we have multiple instances of the same AOT backward. This likely only showed up now because of increased hierarchical graph reuse. I fix it by adding a postfix counter to the node name Pull Request resolved: https://github.com/pytorch/pytorch/pull/144202 Approved by: https://github.com/bdhirsh, https://github.com/jansel	2025-01-07 20:23:09 +00:00
Shangdi Yu	72e8f34715	[AoTI Minifier] UX Improvement (#143330 ) Summary: - When a user specify `TORCHINDUCTOR_MAX_AUTOTUNE=1` env variable, we add `config.max_autotune=True` to the generated minifier_launcher - We should do this to other inductor configs as well in a followup Diff Currently in dynamo and aoti minifier, if a config is overwritten by an env variable, the config will not show up in the config list in the minifier_launcher.py file. As a result, when running the minifier_launcher, they need to re-apply the same env variable. This is: 1) not convenient for the users 2) if they copy-paste the minifier_launcher.py to us without including the env variable, we could be confused and not able to reproduce the error. Underlying implementation change: - Add `env_default` parameter to `codegen_config()`. If set, configs overriden by the env are not considered default. Test Plan: ``` buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test:utils -- -r test_codegen_config ``` Differential Revision: D67299312 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143330 Approved by: https://github.com/jansel, https://github.com/eellison	2025-01-07 20:04:19 +00:00
bobrenjc93	096cb874d3	remove allow-untyped-defs from torch/_prims/executor.py (#144233 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144233 Approved by: https://github.com/Skylion007	2025-01-07 19:40:40 +00:00
Sampsa	0aa74d0ab9	Skip L1 cache for single-use buffers (#143115 ) ### 1. Synopsis Adds `cache_modifier='.cg'` optional argument into `tl.load` instructions in the inductor-generated triton code for selected buffers. It makes the `tl.load` instruction to skip the L1 cache for short-lived / non-reused data. ### 2. Using the feature This feature is experimental and disabled by default. It can be enabled by setting the environmental variable `TORCHINDUCTOR_SKIP_L1` equal to `1`. ### 3. Results For a simple pointwise addition kernel: ```python @torch.compile def add_dummy(x: torch.Tensor, y: torch.Tensor): return x+y ``` we get (bandwith performance is in GB/s): (a) feature DISABLED: ![image](https://github.com/user-attachments/assets/6caaf775-f083-4943-a61f-8a1bcb154387) (b) feature ENABLED: ![image](https://github.com/user-attachments/assets/9286be7d-c6ff-4a33-a023-77cb5cc87ff6) ### 4. Caveats The feature boost is only available when using ```python torch._dynamo.config.cache_size_limit = 64 # or any other sufficiently big number.. torch._dynamo.config.automatic_dynamic_shapes = False # use static shapes ``` When using (the default) dynamic shapes, only 1-2 triton kernels are generated with non-optimal block-sizes for all the cases (vector sizes), hiding any perf benefit from skipping the L1 cache. In the static case, as an optimal block size is generated for each vector size, the perf benefit of skipping the L1 cache becomes visible. This block-size optimization issue is a larger problem in pytorch inductor and is outside the scope of this feature. ### 5. References - [tl.load](https://triton-lang.org/main/python-api/generated/triton.language.load.html#triton.language.load) - [cache operators](https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#cache-operators) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143115 Approved by: https://github.com/jansel	2025-01-07 19:35:40 +00:00
Randolf Scholz	355b0bc7e3	[typing] Add type hints to `@property` and `@lazy_property` in `torch.distributions`. (#144110 ) Fixes #76772, #144196 Extends #144106 - added type annotations to `lazy_property`. - added type annotation to all `@property` and `@lazy_property` inside `torch.distributions` module. - added simply type-check unit test to ensure type inference is working. - replaced deprecated annotations like `typing.List` with the corresponding counterpart. - simplified `torch.Tensor` hints with plain `Tensor`, otherwise signatures can become very verbose. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144110 Approved by: https://github.com/Skylion007	2025-01-07 19:27:36 +00:00
hongxyan	aa69d73e6b	[ROCm] fix torch.layer_norm invalid configuration problem when input is large tensor (#144007 ) Fixes #136291 This PR is to fix the `invalid configuration argument` problem happened on ROCm when input is a large tensor when calling `torch.layer_norm`. ``` File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/functional.py", line 2573, in layer_norm return torch.layer_norm RuntimeError: HIP error: invalid configuration argument ``` After investigation, I found that the reason why this error happened is: The amd compute language runtime checks whether `gridDim.x * blockDim.x` is greater than `std::numeric_limits<uint32_t>::max()` or not. If yes, it will error out with the "invalid configuration argument" message. The fix is to split the whole task to several chunks so that each chunk will not trigger the failure condition. This will ensure the correctness and completeness given the current kernel implementation logic of `vectorized_layer_norm_kernel`. Also added a largeTensor layer_norm unit test `test_layer_norm_large_tensor` with the same shape `[16, 3000, 3000, 16]` as the one used by the pytorch issue #136291 so that the unit test can check the expected output value to ensure correctness. The future work may include performance optimization of layer_norm and CK layer_norm integration. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144007 Approved by: https://github.com/eqy	2025-01-07 19:17:02 +00:00
PyTorch MergeBot	6c54963f75	Revert "[dtensor] move all tests to distribute/tensor folder (#144166 )" This reverts commit 2e1ea8598f477322965c28fb52e6e5f53876d8dd. Reverted https://github.com/pytorch/pytorch/pull/144166 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but inductor/test_compiled_autograd needs to be updated ([comment](https://github.com/pytorch/pytorch/pull/144166#issuecomment-2575969871))	2025-01-07 18:31:36 +00:00
Aaron Gokaslan	e4a05dec0f	[BE][Ez]: Fix docs recommending inefficient tensor op order (#144270 ) `detach().clone()` is faster than `.clone().detatch()` since the gradients are not cloned. Let's update all the documentation and tests so that users do not use the inefficient op ordering. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144270 Approved by: https://github.com/awgu, https://github.com/XuehaiPan	2025-01-07 17:31:32 +00:00
atalman	8d35333498	[CD] Aarch64 builds should not override `OVERRIDE_PACKAGE_VERSION` envvar (#144285 ) Currently our nightly aarch64 binaries have correct suffixes +cpu or +cu126. But release binaries are missing these suffixes. Hence to correct this, make sure are nightly and release binaries are consistent, I propose this change. I see that override is already set correctly in release workflow: https://github.com/pytorch/pytorch/actions/runs/12383179841/job/34565381200 For CPU: ``` OVERRIDE_PACKAGE_VERSION="2.6.0+cpu" ``` For CUDA: ``` OVERRIDE_PACKAGE_VERSION="2.6.0+cu126" ``` The removed code will set : OVERRIDE_PACKAGE_VERSION="2.6.0" for both cuda and cpu builds for release binaries. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144285 Approved by: https://github.com/malfet, https://github.com/tinglvv	2025-01-07 12:50:54 +00:00
Avik Chaudhuri	12fdb93ebd	fix non-strict placeholder naming with kwargs (#144278 ) Fixes https://github.com/pytorch/pytorch/issues/143732 Differential Revision: [D67872055](https://our.internmc.facebook.com/intern/diff/D67872055/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144278 Approved by: https://github.com/yushangdi, https://github.com/pianpwk	2025-01-07 11:22:09 +00:00
Evgeny Fiksman	c3b28491c8	[caffe2] Add AVX512 support for box_cox operator (#143627 ) Summary: Reuse templetized implementation of box_cox caffe2 operator. * Duplicate .cc file of AVX2 * change intrinsics functions to use AVX512 instructions * override templates * extend the caller to use new methods * guard AVX512 with a gflag to allow smooth transition Differential Revision: D67433457 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143627 Approved by: https://github.com/hl475	2025-01-07 09:54:39 +00:00
RAHUL SINGH	bf7747e935	Tests Generelization for multiple accelerator devices (#139184 ) Motivation: Generalize unit tests so that can be executed for cuda and non cuda devices. Depedency : #133209 Merged now. There was a #135242 for these changes and closed due to in correct commits. I have incoroprated the changes as suggested in comments. @kwen2501 @zeshengzong Please review the changes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139184 Approved by: https://github.com/kwen2501 Co-authored-by: Yu, Guangye <guangye.yu@intel.com>	2025-01-07 09:04:38 +00:00
Wanchao Liang	2e1ea8598f	[dtensor] move all tests to distribute/tensor folder (#144166 ) as titled, mainly moving files Pull Request resolved: https://github.com/pytorch/pytorch/pull/144166 Approved by: https://github.com/Skylion007	2025-01-07 06:45:14 +00:00
Simon Fan	d0f5df83a5	[ca] add test_dtensor_compile.py to compiled autograd tests (#144107 ) more than half the tests use autograd, pass rate 19/26 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144107 Approved by: https://github.com/zou3519, https://github.com/bdhirsh, https://github.com/jansel	2025-01-07 05:16:14 +00:00
bobrenjc93	fcf9dc3b11	Migrate from Tuple -> tuple in benchmarks (#144259 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144259 Approved by: https://github.com/yanboliang	2025-01-07 04:09:52 +00:00
Natalia Gimelshein	2e42be0595	Use random64 in Fischer-Yates algorithm for large N (#143682 ) Fixes bug in randperm https://nbsanity.com/static/a4774194938414dedcec7d6e99727d31/Shuffling_20in_20torch_20vs_20numpy-public.html Pull Request resolved: https://github.com/pytorch/pytorch/pull/143682 Approved by: https://github.com/eqy, https://github.com/albanD, https://github.com/malfet	2025-01-07 03:48:56 +00:00
Davide Italiano	551f104153	[mps/inductor] Add support for sign(). (#144298 ) Drive-by fix of a test name while I was at it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144298 Approved by: https://github.com/malfet	2025-01-07 03:33:26 +00:00
bobrenjc93	a3ab27b8e0	Migrate from Tuple -> tuple in torch/_inductor (#144264 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144264 Approved by: https://github.com/eellison	2025-01-07 03:27:27 +00:00
PyTorch MergeBot	778d953951	Revert "[AsyncMM] re-enable and prepare for cutlass 3.5.1 update (#144011 )" This reverts commit 24ac87392bc4e0060a90483643f7df5611988ae5. Reverted https://github.com/pytorch/pytorch/pull/144011 on behalf of https://github.com/malfet due to Not sure what is going on, but lots of builds are failing ([comment](https://github.com/pytorch/pytorch/pull/144011#issuecomment-2574317669))	2025-01-07 03:24:01 +00:00
PyTorch MergeBot	f4e9aebbcc	Revert "Update torch.masked.mean to upcast dtype for bool tensors (#139999 )" This reverts commit 0742b2366e7ba65e0437a17b09a3bb0804ae51ea. Reverted https://github.com/pytorch/pytorch/pull/139999 on behalf of https://github.com/huydhn due to Sorry for reverting your change but I think it has a landrace and fails a test in trunk ([comment](https://github.com/pytorch/pytorch/pull/139999#issuecomment-2574283986))	2025-01-07 02:42:55 +00:00
bobrenjc93	168c2cb3f3	remove allow-untyped-defs from torch/nn/utils/_deprecation_utils.py (#144231 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144231 Approved by: https://github.com/albanD	2025-01-07 02:22:22 +00:00
Yifu Wang	24ac87392b	[AsyncMM] re-enable and prepare for cutlass 3.5.1 update (#144011 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144011 Approved by: https://github.com/Skylion007, https://github.com/drisspg	2025-01-07 02:15:42 +00:00
leslie-fang-intel	73a6a40346	[Inductor][CPP] Fix outer loop fusion buffer removed (#144243 ) Summary Fix issue: https://github.com/pytorch/pytorch/issues/144186. For the test case reported in the issue, we have saw some nodes with `LoopNest` - `LoopNest(loops=[LoopLevel(var=x0, size=8, offset=0, tiled_size=0, steps=1, parallel=0, simd_omp=False, simd_vec=False, collapsed=False, is_reduction=False), LoopLevel(var=x1, size=8, offset=0, tiled_size=0, steps=1, parallel=0, simd_omp=False, simd_vec=False, collapsed=False, is_reduction=True)], kernel=<torch._inductor.codegen.cpp.CppKernelProxy object at 0x7fc724426680>)` - `LoopNest(loops=[LoopLevel(var=x0, size=8, offset=0, tiled_size=0, steps=16, parallel=0, simd_omp=False, simd_vec=True, collapsed=False, is_reduction=False), LoopLevel(var=x1, size=8, offset=0, tiled_size=0, steps=16, parallel=0, simd_omp=False, simd_vec=True, collapsed=False, is_reduction=True)], kernel=<torch._inductor.codegen.cpp.CppKernelProxy object at 0x7fc75c2cae60>)` Although, these 2 `LoopNest` have same `range` and `var`, but different `steps` 1 and 16. So, they will fail to be merged with outer loops. And since when we localize the buffer, we have removed the global buffers. We need to restore the status of `V.graph.removed_buffers` before fallback to codegen without outer loop fusion. Test Plan ``` python -u -m pytest -s -v test/inductor/test_cpu_repro.py -k test_outer_loop_fusion_buffer_remove ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/144243 Approved by: https://github.com/jgong5	2025-01-07 01:17:46 +00:00
Jane Xu	2f6f13562f	[BE] Actually suppress vmap warning from gradcheck (#144287 ) This is the much safer change compared to https://github.com/pytorch/pytorch/pull/144283 Before: ``` PYTORCH_TEST_CUDA_MEM_LEAK_CHECK=1 PYTORCH_TEST_WITH_SLOW_GRADCHECK=1 python test/test_optim.py -k TestDifferentiableOptimizer.test_sgd /data/users/janeyx/pytorch/torch/autograd/gradcheck.py:1156: FutureWarning: Please use torch.vmap instead of torch._vmap_internals.vmap. result = vmap(vjp)(torch.stack(grad_outputs)) /data/users/janeyx/pytorch/torch/autograd/gradcheck.py:1156: FutureWarning: Please use torch.vmap instead of torch._vmap_internals.vmap. result = vmap(vjp)(torch.stack(grad_outputs)) . ---------------------------------------------------------------------- Ran 1 test in 0.028s ``` (the env vars aren't necessary) After: ``` PYTORCH_TEST_CUDA_MEM_LEAK_CHECK=1 PYTORCH_TEST_WITH_SLOW_GRADCHECK=1 python test/test_optim.py -k TestDifferentiableOptimizer.test_sgd . ---------------------------------------------------------------------- Ran 1 test in 0.028s ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/144287 Approved by: https://github.com/cyyever, https://github.com/soulitzer	2025-01-07 01:11:41 +00:00
Nikita Shulga	61c0a3d1cb	Fix lint in `test_provenance_tracing.py` (#144296 ) Regression introduced by https://github.com/pytorch/pytorch/pull/143684/ that somehow did not surface on PR CI IMO this also makes two branches of the test(compile vs aoti) more readable Pull Request resolved: https://github.com/pytorch/pytorch/pull/144296 Approved by: https://github.com/xw285cornell, https://github.com/huydhn	2025-01-07 01:11:38 +00:00
Xu Han	48153c72b2	[Intel XPU] enable kineto for XPU Windows. (#144034 ) This PR will turn on `kineto` on Windowx XPU wheel build. For `kineto` on Windows XPU, the build time dependencies list: 1. Intel PTI, it contained by oneAPI 2025+. 2. Level zero SDK: https://github.com/oneapi-src/level-zero/releases/download/v1.14.0/level-zero-sdk_1.14.0.zip Note: We need to manual setup level zero SDK on build time, so we will turn off kineto build on Windows XPU by default. It is in order to avoid developer occurred build issue. After add level zero SDK include path to `INCLUDE` env_var path. We can add an env_var `XPU_ENABLE_KINETO` to turn on it. For runtime dependency: 1. Intel-pti pipy package. @chuanqi129 will follow up on further PR. Local tested the nightly binary: <img width="1909" alt="image" src="https://github.com/user-attachments/assets/7dfaa7bc-e8ed-40b8-bc71-f91a3df3b95f" /> TODO: @chuanqi129 will submit a following PR to add `intel-pti` as dependency and turn on env_var `XPU_ENABLE_KINETO` for nightly build. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144034 Approved by: https://github.com/chuanqi129, https://github.com/zejun-chen, https://github.com/EikanWang, https://github.com/sraikund16	2025-01-07 01:11:25 +00:00
George Wigley	0742b2366e	Update torch.masked.mean to upcast dtype for bool tensors (#139999 ) When calling `torch.masked.mean(...)` with a boolean tensor, the dtype is inferred to be bool. When the mean is being computed, the sum operator is used. When the sum operator is used with dtype=torch.bool, the result is clamped to True (1) leading to an incorrect mean being calculated. The below example shows how the incorrect result occurs: ``` a = torch.tensor([True, True]) count = torch.sum(torch.ones(a.shape, dtype=torch.int64)) # 2 total = torch.sum(a, dtype=torch.bool) # True (1) mean = total / count # 0.5 ``` This PR upcasts the dtype used for the sumation to int32 in the case of bool tensors allowing for the correct result to be computed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139999 Approved by: https://github.com/cpuhrsch	2025-01-07 00:26:59 +00:00
Henry Hu	f013cfee38	[TreeSpec] Support enum in defaultdict (#144235 ) Summary: Followup from D66269157, add support for enum in defaultdict. Test Plan: Added unit test Differential Revision: D67832100 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144235 Approved by: https://github.com/henrylhtsang, https://github.com/houseroad	2025-01-07 00:10:46 +00:00
Tugsbayasgalan Manlaibaatar	c68c38c673	Support getattr for tensor subclasses in pre-dispatch export via patching tensor.getattr (#143946 ) Previous discussion: https://github.com/pytorch/pytorch/pull/143671#issuecomment-2560112499 and https://github.com/pytorch/pytorch/pull/143671 Differential Revision: [D67693609](https://our.internmc.facebook.com/intern/diff/D67693609) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143946 Approved by: https://github.com/bdhirsh	2025-01-06 23:55:50 +00:00
bobrenjc93	66059f80d2	Migrate from Tuple -> tuple in torch/profiler (#144257 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144257 Approved by: https://github.com/sraikund16	2025-01-06 23:34:14 +00:00
Laith Sakka	5ccbfffd11	update expected results (#144274 ) this PR `f6488d85a0` made it +1.3% < 1.5%. once we have the API from dev infra and change the test this wont be happening. <img width="364" alt="Screenshot 2025-01-06 at 11 01 15 AM" src="https://github.com/user-attachments/assets/401b2d11-e400-49d6-b6f9-8e10ca141cb0" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/144274 Approved by: https://github.com/oulgen, https://github.com/anijain2305	2025-01-06 23:18:21 +00:00
Rachel Guo	f879a6982d	Enhance provenance tracing unit test to cover `torch.compile()` (#143684 ) Summary: Follow up as title. Test Plan: ``` buck2 run -c fbcode.enable_gpu_sections=true -c fbcode.nvcc_arch=h100 @//mode/opt fbcode//caffe2/test/inductor:provenance_tracing -- -r test_triton_kernel_to_post_grad_tracing ``` Differential Revision: D67543556 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143684 Approved by: https://github.com/yushangdi	2025-01-06 22:58:04 +00:00
Isuru Fernando	301b9c8a90	Fix PythonMod printing (#144078 ) Fixes #144075 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144078 Approved by: https://github.com/anijain2305	2025-01-06 22:52:34 +00:00
bobrenjc93	edbda2fad8	remove allow-untyped-defs from torch/export/_remove_auto_functionalized_pass.py (#144230 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144230 Approved by: https://github.com/Skylion007	2025-01-06 22:23:19 +00:00
bobrenjc93	d75ffccd0a	Migrate from Tuple -> tuple in torch/_export (#144262 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144262 Approved by: https://github.com/avikchaudhuri	2025-01-06 22:20:26 +00:00
Andrew Gu	00c18c8882	Make all-reduce input contiguous in `distributed.nn.all_reduce` (#144267 ) Fixes https://github.com/pytorch/pytorch/issues/144060 I confirmed that the unit test fails without the `.contiguous()` fix. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144267 Approved by: https://github.com/wz337, https://github.com/Skylion007, https://github.com/fduwjj	2025-01-06 22:20:04 +00:00
Nikita Shulga	16c1b1048b	[MPSInductor] Add `nan` constant generation (#144281 ) If val is not equal to self, it's a nan (which is spelled as `NAN` in Metal) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144281 Approved by: https://github.com/atalman, https://github.com/dcci	2025-01-06 22:13:23 +00:00
Nikita Shulga	7d5249dbc2	[EZ][BE] Fix E226 flake8 violation (#144282 ) Not sure why CI did not complain about it, but it my local runs it clearly says ``` Advice (FLAKE8) E226 missing whitespace around arithmetic operator See https://www.flake8rules.com/rules/E226.html 268 \| with code.indent(): 269 \| if len(idx_var_names) > 1: 270 \| for idx, name in enumerate(idx_var_names): >>> 271 \| code.writeline(f"auto {name} = thread_pos.{chr(120+idx)};") ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/144282 Approved by: https://github.com/Skylion007	2025-01-06 22:12:21 +00:00
Ryan Guo	5d88002af6	[inductor] Avoid specializing over symbolic value during constant folding (#144176 ) Fixes #143667. See more context in the issue. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144176 Approved by: https://github.com/jansel, https://github.com/eellison	2025-01-06 21:50:17 +00:00
Faran Ahmad	729b7c0a84	[TGIF][Easy] Slightly improve the logging for tgif split pass (#143771 ) Summary: 1. Added more details for some of the assert statements. 2. Moved assert statements to use tgif_assert Test Plan: all unit tests should pass Reviewed By: jingsh Differential Revision: D67608251 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143771 Approved by: https://github.com/jingsh	2025-01-06 21:00:15 +00:00
Aaron Gokaslan	b5cf8e2460	[BE]: Remove redundant copy in torch chunk shard (#144269 ) Fixes an issue noticed in recent all_gather PR. Some parts of the codebase have a double copy with `clone().contiguous()` which could be fused into a single copy op. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144269 Approved by: https://github.com/awgu	2025-01-06 20:52:49 +00:00
bobrenjc93	1b8a943011	remove allow-untyped-defs from ao/nn/sparse/quantized/utils.py (#144232 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144232 Approved by: https://github.com/Skylion007	2025-01-06 19:54:27 +00:00
Doru Bercea	6d445bef0c	[ROCm][NFC] Fix condition for small tensor tuning (#144087 ) Fix condition for small tensor tuning to not impact non-ROCm compilation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144087 Approved by: https://github.com/jeffdaily	2025-01-06 19:40:22 +00:00
Marc Horowitz	c62873a09a	Fix incorrect python expression (#143675 ) Summary: This expression would return True always, causing the input to be deleted on error, even for non-write modes: ``` >>> bool("w" or "+" or "a" in "rb") True ``` Test Plan: new test in test_fsspec.py Differential Revision: D67537234 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143675 Approved by: https://github.com/mayankgarg1990, https://github.com/huydhn	2025-01-06 19:04:26 +00:00
Shangdi Yu	e3aac7f8a0	detect fake mode in proxy_tensor creation in make_fx (#144168 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/143742 A FakeTensorMode may already exist when we are setting the "val" meta of a proxy tensor. We should detect existing FakeTensorMode before creating a new one. Otherwise, we could cause an error when using `detect_fake_mode` later, because there are now multiple FakeTensorModes existing. Test Plan: The error in https://github.com/pytorch/pytorch/issues/143742 Differential Revision: D67813111 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144168 Approved by: https://github.com/BoyuanFeng, https://github.com/tugsbayasgalan	2025-01-06 19:02:08 +00:00
Nikita Shulga	e56768f030	[MPS] Fix bitwise shifts for uint8 (#144251 ) Previosly all bitwise operations were aliased to the same type, but this is wrong for shift ops Rather than building an overly complex logic, let's just instantiate using shared `scalarToMetalTypeString` helper function Fixes https://github.com/pytorch/pytorch/issues/144190 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144251 Approved by: https://github.com/Skylion007 ghstack dependencies: #144249, #144250	2025-01-06 18:27:16 +00:00
PyTorch MergeBot	aa14fcd96c	Revert "export AOTI_TORCH_EXPORT on Windows. (#140030 )" This reverts commit e141cb9c34e5e96ca47ea69b565bc4fd9c8f34c1. Reverted https://github.com/pytorch/pytorch/pull/140030 on behalf of https://github.com/clee2000 due to still failing internally D67556174, see D67866123 for link to error ([comment](https://github.com/pytorch/pytorch/pull/140030#issuecomment-2573652459))	2025-01-06 18:15:52 +00:00
Nikita Shulga	ebeb433e73	[BE] Fix + parametrize `test_min_max_nan_propagation` (#144250 ) - `dtype` was not passed as argument to `torch.rand` before - Condition bfloat16 testing on MacOS14+ Pull Request resolved: https://github.com/pytorch/pytorch/pull/144250 Approved by: https://github.com/Skylion007 ghstack dependencies: #144249	2025-01-06 17:49:41 +00:00
Nikita Shulga	11a0663eeb	[BE] Parametrize `test_min_max` (#144249 ) It's better to have one unit test per dtype rather a combined one Pull Request resolved: https://github.com/pytorch/pytorch/pull/144249 Approved by: https://github.com/Skylion007	2025-01-06 17:49:41 +00:00
Tugsbayasgalan Manlaibaatar	d65a50ef34	Fix subclass unwrapping bug (#143945 ) I noticed a small bug in tensor subclass unwrapping logic. cc @IvanKobzarev It seems easier if we just implement it recursively so that it is easier to track the inner attrs to corresponding plain tensors and both aot_autograd and fake_tensor implement subclass unwrapping recursively. Differential Revision: [D67693610](https://our.internmc.facebook.com/intern/diff/D67693610) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143945 Approved by: https://github.com/IvanKobzarev	2025-01-06 17:38:19 +00:00
Aaron Gokaslan	5c783bf410	[BE][Ez]: Update CUDNN Frontend submodule to 1.9.0 (#144200 ) * Update CUDNN Frontend to 1.9.0, which include some API improvements, new features, and bugfixes. This is a header only lib fix so should be pretty straight forward. * Nicest feature is that it now logs / print warnings when the CUDNN compiled version does not match the dynamically loaded one * Fixes corrupted / truncated log lines from being printed by CUDNN Frontend Pull Request resolved: https://github.com/pytorch/pytorch/pull/144200 Approved by: https://github.com/cyyever, https://github.com/albanD	2025-01-06 17:33:38 +00:00
Jane Xu	c8713e659a	fix memleak, detach instead of clone to not drag around graph (#144154 ) Thanks @clee2000 for bringing the memleak to my attention: https://github.com/pytorch/pytorch/actions/runs/12549765082/job/34996244798. This memleak in the test was caused by the differentiable flavors. Because we had param.clone() and param persisted outside the for loop, the autograd graph would continue growing for each optimizer.step instead of being deleted after the optim input was used up. To clarify, I had still expected (and still do expect) the test to fully clean everything up once the test is over, but I didn't get the chance to look into why that's not the case. This change would preliminarily unblock this particular test from failing the memleak CI. Use detach instead of clone, which is...cheaper anyway :D since a detach I've learned from @soulitzer is a view with requires_grad=False Pull Request resolved: https://github.com/pytorch/pytorch/pull/144154 Approved by: https://github.com/clee2000, https://github.com/Skylion007, https://github.com/huydhn, https://github.com/albanD	2025-01-06 17:09:00 +00:00
Guilherme Leobas	e222dd5d25	Rewrite _reparametrize_module to use `contextmanager` (#138203 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138203 Approved by: https://github.com/zou3519 ghstack dependencies: #136033, #140604	2025-01-06 16:56:22 +00:00
Guilherme Leobas	4c8d661348	Set `enable_trace_contextlib_contextmanager` flag to True (#140604 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/140604 Approved by: https://github.com/zou3519 ghstack dependencies: #136033	2025-01-06 16:56:22 +00:00
Luca Wehrstedt	defbf0d339	[DTensor] Add strategy for _scaled_mm (#143760 ) This is done by copying the one for a regular mm, and enforcing that the scales have the same sharding scheme as their respective operands. This works because scales are 2-d tensors that must "broadcast" to the operands. This broadcasting is trivial when scales have dimensions of 1 or N, which is the only options we currently support. Note, however, that after this PR scales will be allowed to have the mesh's world size as a dimension (in certain cases). This works because, when mapped to the local shard, it becomes a dimension of 1, which can be handled by the operator. Note that when using row-wise _scaled_mm for tensor (sequence) parallelism, this situation arises naturally! Because of these specificities, the test is rather complex, as it specifically tests all these behaviors. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143760 Approved by: https://github.com/tianyu-l	2025-01-06 16:35:47 +00:00
yijun-lee	d4609af1ca	Propagate callable parameter types using ParamSpec (#142306 ) (#144047 ) Fixes #142306 This PR includes typing improvements and refactoring for the following files: - __init__.py - decorators.py - _ops.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/144047 Approved by: https://github.com/XuehaiPan, https://github.com/Skylion007 Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com> Co-authored-by: Xuehai Pan <XuehaiPan@pku.edu.cn>	2025-01-06 16:16:18 +00:00
cyy	9225f149eb	Enable clang-analyzer checks of Clang-tidy (#144222 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/144222 Approved by: https://github.com/Skylion007	2025-01-06 15:44:45 +00:00
Pian Pawakapan	bba672e117	[docs/export] update dynamic_shapes docs (#142510 ) https://pytorch.org/docs/stable/export.html dynamic_shapes section formatting is messed up, fix & update documentation to be more user-friendly. Happy accepting nits :) Pull Request resolved: https://github.com/pytorch/pytorch/pull/142510 Approved by: https://github.com/yushangdi	2025-01-06 14:12:34 +00:00
PyTorch UpdateBot	d85ae4be73	Update slow tests (#144236 ) This PR is auto-generated weekly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/weekly.yml). Update the list of slow tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144236 Approved by: https://github.com/pytorchbot	2025-01-06 11:19:09 +00:00
Sun, Jiayi	a8e97d9d4d	fix torch.acos and torch.asin for torch.complex datatypes on CPU (#134838 ) Fix https://github.com/pytorch/pytorch/issues/134487, https://github.com/pytorch/pytorch/issues/138327. These two issues are caused by the lack of special handling of the case where the real number/imag number is 0/Inf/NaN in the vectorized implementation of `asin`. For correctness, I temporarily fallback the implementation of `asin `to scalar implementation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134838 Approved by: https://github.com/mingfeima, https://github.com/Skylion007	2025-01-06 06:17:39 +00:00
eellison	e1622dca7a	Fix duplicate pattern error (#139321 ) vllm had an error when we were incorrectly stating two patterns are duplicates. See, comment inline: For a particular generated pattern repr, store all the equivalent graphs that used to generate them. Because we ignore certain patterns in searching, but not in matching, use the graph to distinguish if two equivalent searches are actually different. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139321 Approved by: https://github.com/shunting314	2025-01-06 05:04:59 +00:00
PyTorch MergeBot	cb5fa17e44	Revert "[ca] add test_dtensor_compile.py to compiled autograd tests (#144107 )" This reverts commit 67f85ccdcf56894d653b4d37cd7651eefa0ddf8c. Reverted https://github.com/pytorch/pytorch/pull/144107 on behalf of https://github.com/huydhn due to Sorry for reverting your change but the failure looks legit ([comment](https://github.com/pytorch/pytorch/pull/144107#issuecomment-2572209717))	2025-01-06 03:30:22 +00:00
Davide Italiano	c9ef98478a	[mps/BE] Enable a test that now passes. (#144198 ) After the implementation of floordiv in `464b50dbd7` landed, this now passes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144198 Approved by: https://github.com/Skylion007, https://github.com/malfet	2025-01-06 03:14:21 +00:00
Davide Italiano	23e2953cd3	[mps/inductor] Add support for floor(). (#144195 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144195 Approved by: https://github.com/jansel	2025-01-06 02:07:17 +00:00
Ding, Yi1	d71f111109	[Inductor][CPP] Fix Inductor integer avg pool (#144059 ) Fixes #143738. Currently the scaler for averaging is rounded to 0 if dtype is an integer, resulting to all-zero output. This fix uses `truediv` instead for integer cases. ## Test ```bash pytest -vs ./test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCPU::test_comprehensive_nn_functional_avg_pool1d_cpu_int64 pytest -vs ./test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCPU::test_comprehensive_nn_functional_avg_pool2d_cpu_int64 pytest -vs ./test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCPU::test_comprehensive_nn_functional_avg_pool3d_cpu_int64 pytest -vs ./test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCPU::test_comprehensive_nn_functional_local_response_norm_cpu_int64 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/144059 Approved by: https://github.com/leslie-fang-intel, https://github.com/jansel, https://github.com/jgong5	2025-01-06 01:26:53 +00:00
Xiaodong Wang	3d3a07963f	[reland][attempt2][AMD] Turn on TF32 for aten::mm (#144145 ) Summary: https://github.com/pytorch/pytorch/pull/143549 was reverted due to some internal/oss tooling issue. Relanding. hipblaslt supports TF32, so adding the support. Original PR https://github.com/pytorch/pytorch/pull/139869 Test Plan: CI Differential Revision: D67785496 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144145 Approved by: https://github.com/jianyuh	2025-01-06 00:37:01 +00:00
Jack Morris	9f94710e48	Update core.py to fix typo (#144201 ) dype -> dtype Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/144201 Approved by: https://github.com/Skylion007	2025-01-05 18:20:52 +00:00
Mitchell, Frost	51a37a42e0	[inductor][cpu] Fix bmm b_index for dynamic expressions in inductor autotuner (#143141 ) Fixes #143102 Addresses 2 problems relating to dynamic batch size in BMM autotuner: 1. With dynamic batch size, when the input is a sympy Mult expression, such as `s0*8` which occurs in many dynamo benchmark models. We address this by using `size_hints` to solve for any expressions. This is safe since this section of the code is only called to generate inputs for benchmarking. 2. Some epilogue nodes may use the dynamic batch size as part of the codegen, for example when an input to the epilogue node is transposed and has dynamic batch size in the stride of other dimensions. When these epilogue nodes exist, if the sizevar is not already present in the `kernel.args`, it will create a new sizevar with a name. It is possible that subsequent calls to `def_kernel` could overwrite this variable name, so to avoid this we pass all the sizevars as `extra_sizevars` to the calls to `def_kernel` for the GEMM functions, so no variable renaming happens later in the BMM definition. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143141 Approved by: https://github.com/jansel, https://github.com/leslie-fang-intel, https://github.com/jgong5	2025-01-05 18:02:37 +00:00
Animesh Jain	f6488d85a0	[dynamo][user-defined] Remove __getattribute__ checks and add getsetdescriptor (#144173 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144173 Approved by: https://github.com/jansel	2025-01-05 13:48:15 +00:00
PyTorch MergeBot	b01556bd8a	Revert "[dynamo][dicts] Guarding lazily on dict keys (#143997 )" This reverts commit f5df082fabfe81639e25b8e01dae107548389c5e. Reverted https://github.com/pytorch/pytorch/pull/143997 on behalf of https://github.com/jeanschmidt due to Seems to have introduced internal ci redness in some tests, D67828366 ([comment](https://github.com/pytorch/pytorch/pull/143997#issuecomment-2571587599))	2025-01-05 11:09:45 +00:00
Yutao Xu	1e881ceecf	Update torch-xpu-ops commit pin (#143984 ) Update the torch-xpu-ops commit to [28cfac20ec662abdb0ac98faf122450013e8f520](`28cfac20ec`), includes: - Disable batch_norm vectorization path to fix accuracy issues. - Fix the LSRM/RNN implementation error. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143984 Approved by: https://github.com/EikanWang, https://github.com/ruidazeng, https://github.com/desertfire, https://github.com/jansel	2025-01-05 09:01:36 +00:00
Jason Ansel	157c185afe	[inductor] Add types to compile_tasks.py and runtime_utils.py (#144004 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144004 Approved by: https://github.com/yanboliang	2025-01-05 08:47:49 +00:00
Simon Fan	67f85ccdcf	[ca] add test_dtensor_compile.py to compiled autograd tests (#144107 ) more than half the tests use autograd, pass rate 19/26 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144107 Approved by: https://github.com/zou3519, https://github.com/bdhirsh, https://github.com/jansel	2025-01-05 02:11:48 +00:00
James Wu	f2d6cfa677	Introduce CompileEventLogger, replace usages of metrics_context and chromium_event with it (#143420 ) Problem statement: I want to be able to centralize and simplify the process by which people add columns/data to existing spans. We have MetricsContext and ChromiumEventLogger, and there's various choices you can make to decide where and when to log different levels of observability for your events. To resolve this, I want a central API for "adding to events under dynamo_timed". CompileEventLogger is intended as a frontend for MetricsContext and ChromiumEventLogger so we can use the same class for handling everything. CompileEventLogger is intended be used within a `dynamo_timed()` context. Its purpose is to 1. log to existing events that are in progress (i.e. within dynamo_timed), and 2. log instant events to chromium that are independent of any specific span. CompileEventLogger has three log levels: - CHROMIUM: Log only to chromium events, visible via tlparse. - PT2_COMPILE: Log to chromium_events + pt2_compile_events - COMPILATION_METRIC: Log to compilation metrics in addition to the toplevel chromium and pt2_compile_event. In addition, we have a function CompileEventLogger.add() that automagically chooses the correct log level. For now, it is conservative, and will never automagically choose to log CompilationMetrics (though I could imagine it figuring out the metadata are all keys in CompilationMetric and therefore loggable there). The goal here is to make one single interface to log stuff for observability reasons, and make it as easy as possible. Not included in this diff: - V1 of this diff will not have implementations of `increment` and `add_to_set` which MetricsContext has, so those usages are not replaced yet. But I'll add those in a followup. - We don't handle `RuntimeMetricsContext`. It's unclear if I want that to be part of this, because under RuntimeMetricsContext there might not be a toplevel event to log to, so chromium events doesn't make sense in that context. So I might leave that separate for now. Differential Revision: [D67346203](https://our.internmc.facebook.com/intern/diff/D67346203/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143420 Approved by: https://github.com/aorenste	2025-01-04 22:40:34 +00:00
Jackson Tsang	68d30c6a25	Add check for unsupported sprase layout to resolve false INTERNAL ASSERT FAILED (#139198 ) Fixes #131319. Implemented the check on layout as described in the original issue. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139198 Approved by: https://github.com/pearu, https://github.com/amjames, https://github.com/cpuhrsch Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com> Co-authored-by: Pearu Peterson <pearu.peterson@gmail.com>	2025-01-04 21:40:36 +00:00
Nikita Shulga	b1bc880f26	[EZ][BE] Cleanup `test_mps_basic` (#144194 ) - Sort imported tests alphabetically - Run `add` tests with `check_lowp=False` as it is tested explicitly by parametrization - Do not hardcode device, but rather use `self.device` property Pull Request resolved: https://github.com/pytorch/pytorch/pull/144194 Approved by: https://github.com/Skylion007, https://github.com/dcci	2025-01-04 21:36:40 +00:00
Davide Italiano	0dc1e6be19	[mps/BE] Fix linter warning/advice. (#144199 ) Two spaces before an inline comment according to E261. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144199 Approved by: https://github.com/Skylion007, https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2025-01-04 20:15:41 +00:00
Richard Barnes	e458b39fc4	c10::string_view -> std::string_view in Device.cpp (#144178 ) Test Plan: Sandcastle Differential Revision: D67817163 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144178 Approved by: https://github.com/Skylion007, https://github.com/malfet	2025-01-04 18:51:33 +00:00
Joona Havukainen	811c714911	Fix nan propagation for minimum() and maximum() in MPS (#144086 ) Fixes #143976 - Moves minimum and maximum operations to use the NaN propagating call into MPSGraph instead of the default one. - Adds test for the NaN propagating case to `test_mps.py`. - Adjusts the inductor metal backend implementation for minimum and maximum to also respect the nan propagation. Additions by @malfet: - Introduce MPSGraph+PyTorchFixups interface following [Customizing existing classes](https://developer.apple.com/library/archive/documentation/Cocoa/Conceptual/ProgrammingWithObjectiveC/CustomizingExistingClasses/CustomizingExistingClasses.html) tutorial and implement `minimumWithNaNPropagationAndIntFallbackWithPrimaryTensor:` as `minimumWithNaNPropagationWithPrimaryTensor:` segfaults when called for integral types Pull Request resolved: https://github.com/pytorch/pytorch/pull/144086 Approved by: https://github.com/malfet Co-authored-by: Nikita Shulga <nshulga@meta.com>	2025-01-04 18:48:24 +00:00
Andrey Talman	60de73c3c7	Update nightly PyTorch version to 2.7.0 Same as https://github.com/pytorch/pytorch/pull/135916	2025-01-04 13:24:48 -05:00
Animesh Jain	f5df082fab	[dynamo][dicts] Guarding lazily on dict keys (#143997 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143997 Approved by: https://github.com/jansel ghstack dependencies: #144129, #144130, #144141, #144158, #144163, #144160	2025-01-04 18:13:00 +00:00
drisspg	005a4b9537	[Submodule] Bump Cutlass to 3.5.1 OSS PR (#144000 ) ## Summary Follow up PR to https://github.com/pytorch/pytorch/pull/143515. That PR added a bunch of macro switches to ensure both 3.4 and 3.5.1 built succesfully. This PR actual bumps the cutlass pin to 3.5.1. I am going to do a stack on top to add an conditional gates for 3.6 hijacking the 3.4 switches. We will leap frog our way to the top :) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144000 Approved by: https://github.com/Skylion007, https://github.com/eqy, https://github.com/malfet	2025-01-04 18:04:03 +00:00
Michal Gallus	93633d0e80	[ROCm][Windows] Fix export macros (#144098 ) For correct import and export of functions when the dynamic linkage is used for HIP libraries on windows, the appropriate export/import macros need to be put in place. This Pull Request utilizes existing CUDA import/export macros by converting them to corresponding HIP macros during the hipification process. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144098 Approved by: https://github.com/jeffdaily	2025-01-04 17:12:46 +00:00
Aaron Orenstein	45ef3309e3	[BE] typing for decorators (#144161 ) Summary: Untyped decorators strip annotations from the decorated items. - _compile - _inductor/fx_passes/post_grad - _inductor/lowering - _library/custom_ops - _meta_registrations - _ops - _refs/nn/functional - ao/quantization/quantizer/xnnpack_quantizer_utils - distributed/_composable/contract - fx/experimental/graph_gradual_typechecker - fx/experimental/migrate_gradual_types/constraint_generator - optim/optimizer - signal/windows/windows - testing/_internal/common_device_type - torch/_inductor/decomposition - utils/flop_counter Test Plan: unit tests Differential Revision: D62302684 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144161 Approved by: https://github.com/Skylion007, https://github.com/albanD	2025-01-04 16:40:09 +00:00
Nichols A. Romero	79cbda3ab0	[ROCm] Get rid of extra rpath-link that was needed for libtinfo. (#143348 ) Fixes #137858 Due to the extra rpath-link line inserted by these CMake lines, it is possible to unintentionally pick up other libraries that are incompatible with the version of ROCm in ${ROCM_PATH}. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143348 Approved by: https://github.com/jithunnair-amd, https://github.com/jeffdaily, https://github.com/pruthvistony	2025-01-04 15:42:30 +00:00
Steven Zeltmann	6f2451c2e9	[MPS] Add `aten::angle` (#143449 ) This adds an MPS backend implementation for `aten::angle` and `aten::angle_out` (mentioned in issue #77764), following the example #78408. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143449 Approved by: https://github.com/malfet	2025-01-04 15:38:40 +00:00
Nikita Shulga	301c457032	[MPS] Fix `nllnd_loss_backward` crash with different dtypes (#144170 ) Otherwise, invoking with torch.half inputs, but float weights will result in ``` (mpsFileLoc): /AppleInternal/Library/BuildRoots/b11baf73-9ee0-11ef-b7b4-7aebe1f78c73/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphUtilities.mm:233:0: error: 'mps.divide' op requires the same element type for all operands and results (mpsFileLoc): /AppleInternal/Library/BuildRoots/b11baf73-9ee0-11ef-b7b4-7aebe1f78c73/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphUtilities.mm:233:0: note: see current operation: %16 = "mps.divide"(%15, %arg2) : (tensor<5x5xf16>, tensor<1xf32>) -> tensor<xf32> (mpsFileLoc): /AppleInternal/Library/BuildRoots/b11baf73-9ee0-11ef-b7b4-7aebe1f78c73/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphUtilities.mm:233:0: error: 'mps.divide' op requires the same element type for all operands and results (mpsFileLoc): /AppleInternal/Library/BuildRoots/b11baf73-9ee0-11ef-b7b4-7aebe1f78c73/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphUtilities.mm:233:0: note: see current operation: %16 = "mps.divide"(%15, %arg2) : (tensor<5x5xf16>, tensor<1xf32>) -> tensor<xf32> 2025-01-03 14:13:18.747151-0800 python[87772:4027380] /AppleInternal/Library/BuildRoots/b11baf73-9ee0-11ef-b7b4-7aebe1f78c73/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphExecutable.mm, line 975: error 'original module failed verification' /AppleInternal/Library/BuildRoots/b11baf73-9ee0-11ef-b7b4-7aebe1f78c73/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphExecutable.mm:975: failed assertion `original module failed verification' ``` Test plan: `python -mpytest test/inductor/test_torchinductor.py -k test_nll_loss_backward_mps` should not crash Pull Request resolved: https://github.com/pytorch/pytorch/pull/144170 Approved by: https://github.com/kit1980, https://github.com/Skylion007 ghstack dependencies: #144167, #144162, #144083, #144084	2025-01-04 15:24:55 +00:00
PyTorch MergeBot	99f2491af9	Revert "Use absolute path `path.resolve()` -> `path.absolute()` (#129409 )" This reverts commit 45411d1fc9a2b6d2f891b6ab0ae16409719e09fc. Reverted https://github.com/pytorch/pytorch/pull/129409 on behalf of https://github.com/jeanschmidt due to Breaking internal CI, @albanD please help get this PR merged ([comment](https://github.com/pytorch/pytorch/pull/129409#issuecomment-2571316444))	2025-01-04 14:17:20 +00:00
cyy	df458be4e5	[4/N] Apply py39 ruff and pyupgrade fixes (#143257 ) ```torch/fx/passes/annotate_getitem_nodes.py``` was changed to support the new type hinting annotations. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143257 Approved by: https://github.com/justinchuby, https://github.com/albanD	2025-01-04 10:47:51 +00:00
Dingming Wu	a881954b0c	[PTD] Dump rcclexp proxy trace in pytorch (#143678 ) Summary: Dump the active proxyOp status per rank and per communicator when WatchDog timeout or aborts. Added `#if defined(USE_ROCM) && defined(NCCL_COMM_DUMP)` guard in the print function, so only rcclexp users will see this dump in console. This is the changes of the PTD. Test Plan: Job with A2A hang due to receiver failing to post receive operations https://fburl.com/mlhub/95vg12r3 {F1971449692} Reviewed By: c-p-i-o Differential Revision: D67036093 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143678 Approved by: https://github.com/c-p-i-o	2025-01-04 10:20:47 +00:00
Huy Do	aa7d01ea22	Use sccache 0.9.0 on ROCm build job (#144125 ) TSIA, sccache 0.9.0 seems to work fine with ROCm build job Pull Request resolved: https://github.com/pytorch/pytorch/pull/144125 Approved by: https://github.com/jithunnair-amd, https://github.com/wdvr, https://github.com/jeffdaily	2025-01-04 08:56:48 +00:00
Valentine233	636a2c7e0f	[Inductor][lowering] support out_dtype for dequant lowering (#143845 ) In lowering, support the parameter `out_dtype` for `dequant_per_tensor` and `dequant_per_channel`. Fix the following runtime error issue found in https://github.com/pytorch/ao/pull/1372: ``` File "/home/liaoxuan/pytorch_ao/torch/_inductor/lowering.py", line 452, in wrapped out = decomp_fn(args, *kwargs) torch._dynamo.exc.BackendCompilerFailed: backend='compile_fx_wrapper' raised: LoweringException: TypeError: quantized_decomposed_dequantize_per_tensor_default() got an unexpected keyword argument 'out_dtype' target: quantized_decomposed.dequantize_per_tensor.default args[0]: TensorBox(StorageBox( InputBuffer(name='arg0_1', layout=FixedLayout('cpu', torch.uint8, size=[1, 7, 7, 9], stride=[441, 63, 9, 1])) )) args[1]: 0.01 args[2]: 100 args[3]: 0 args[4]: 255 args[5]: torch.uint8 kwargs: {'out_dtype': torch.bfloat16} ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/143845 Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel, https://github.com/jansel	2025-01-04 08:48:41 +00:00
Xinran / Allan Rui	417d9c3522	[Inductor/Triton] Upcast FP16/BF16 math reductions to FP32 (#141052 ) Summary: Triton compiler does not automatically promote fp16/bf16 reductions to fp32 accumulation. This will result in significant accuracy issue. This diff will upcast the input to FP32 for all math reductions `["welford_reduce", "welford_combine", "prod", "sum", "xor_sum"]` Test Plan: CI ``` python test/inductor/test_torchinductor.py TritonCodeGenTests.test_low_precision_reduction ``` Differential Revision: D65965032 Pull Request resolved: https://github.com/pytorch/pytorch/pull/141052 Approved by: https://github.com/blaine-rister	2025-01-04 07:57:10 +00:00
Animesh Jain	816328fa51	[dynamo][lazy] LazyVT utils to get original value/source and is_hashable (#144160 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144160 Approved by: https://github.com/williamwen42, https://github.com/jansel ghstack dependencies: #144129, #144130, #144141, #144158, #144163	2025-01-04 06:23:05 +00:00
Nikita Shulga	b5b1e9456a	[MPSInductor] Add `masked` implementation (#144084 ) More or less borrowed from `22580f160e/torch/_inductor/codegen/halide.py (L549-L563)` `pytest test/inductor/test_torchinductor.py -k _mps` score is 408 failed, 347 passed, 32 skipped Pull Request resolved: https://github.com/pytorch/pytorch/pull/144084 Approved by: https://github.com/Skylion007, https://github.com/jansel ghstack dependencies: #144167, #144162, #144083	2025-01-04 04:30:07 +00:00
Shangdi Yu	f15af077fb	Fix get_source_partitions when weights are tied (#142446 ) Summary: Fix https://github.com/pytorch/pytorch/issues/142035 and https://github.com/pytorch/pytorch/issues/143621 When Linear module params are tied to another parameter, like this: ``` class SimpleLinearModel(nn.Module): def __init__(self, input_size, output_size): super(SimpleLinearModel, self).__init__() # Define a linear layer self.linear = nn.Linear(input_size, output_size) self.tied_weight = self.linear.weight def forward(self, x): # Forward pass through the linear layer b = self.tied_weight + 1 return self.linear(x), b ``` We get a graph like below: ``` graph(): %p_tied_weight : [num_users=0] = placeholder[target=p_tied_weight] %p_linear_weight : [num_users=2] = placeholder[target=p_linear_weight] %p_linear_bias : [num_users=1] = placeholder[target=p_linear_bias] %x : [num_users=1] = placeholder[target=x] %add : [num_users=1] = call_function[target=torch.ops.aten.add.Tensor](args = (%p_linear_weight, 1), kwargs = {}) %linear : [num_users=1] = call_function[target=torch.ops.aten.linear.default](args = (%x, %p_linear_weight, %p_linear_bias), kwargs = {}) return (linear, add) ``` Notice that ` %p_linear_weight : [num_users=2]`. When we get source partitions, we should exclude attributes nodes like `p_linear_weight` from outputs. A real world example where people do something like this is in https://github.com/pytorch/pytorch/issues/142035. Test Plan: ``` buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test:fx -- -r test_module_partitioner_weight_tied ``` Differential Revision: D66998592 Pull Request resolved: https://github.com/pytorch/pytorch/pull/142446 Approved by: https://github.com/angelayi	2025-01-04 04:28:20 +00:00
cyy	f9bf9057ef	Fix ruff warnings in caffe2 and functorch (#144182 ) In preparation for upgrading ruff config to py3.9. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144182 Approved by: https://github.com/malfet	2025-01-04 04:15:01 +00:00
Sam Ginzburg	ec1f56fdcf	[user triton] add support for prune_configs_by in @triton.autotune (#142207 ) This PR adds support for prune_configs_by in the @triton.autotune decorator [docs](https://triton-lang.org/main/python-api/generated/triton.autotune.html#triton.autotune). Supporting this lets users reduce autotuning time by running user-supplied code (early_config_prune, perf_model) to prune the provided list of configs. We implement this by realizing args/kwargs in call_triton_kernel(...), and then calling kernel.prune_configs(...). Pull Request resolved: https://github.com/pytorch/pytorch/pull/142207 Approved by: https://github.com/zou3519, https://github.com/aakhundov	2025-01-04 03:50:28 +00:00
Davide Italiano	479d6f2199	[mps/inductor] Add support for log(). (#144169 ) Tested via: ``` % pytest test/inductor/test_mps_basic.py ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/144169 Approved by: https://github.com/jansel, https://github.com/malfet	2025-01-04 03:07:56 +00:00
Animesh Jain	087c625261	[dynamo] Trace torch.typename (#144163 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144163 Approved by: https://github.com/yanboliang, https://github.com/williamwen42, https://github.com/jansel ghstack dependencies: #144129, #144130, #144141, #144158	2025-01-04 02:52:58 +00:00
Animesh Jain	3292220c43	[dynamo][easy] Move symnode helpers to utils (#144158 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144158 Approved by: https://github.com/williamwen42, https://github.com/jansel ghstack dependencies: #144129, #144130, #144141	2025-01-04 02:52:58 +00:00
PHLens	98949df7a4	Fix torch.distributed._functional_collectives.AsyncCollectiveTensor for aten.to. (#134661 ) Fixes #133421 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134661 Approved by: https://github.com/bdhirsh	2025-01-04 02:33:38 +00:00
eqy	7e3cd0e488	[CUDA] Check `size` calculation in `ilpReduce` for `softmax` (#144009 ) For #143644 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144009 Approved by: https://github.com/Skylion007	2025-01-04 02:31:15 +00:00
eqy	dbdda654af	[64-bit][CUDA] Upsample2D 64-bit indexing fix attempt 2 (#141923 ) #141831 Block/thread math requires a cast... Pull Request resolved: https://github.com/pytorch/pytorch/pull/141923 Approved by: https://github.com/ngimel	2025-01-04 02:30:38 +00:00
xinan.lin	1d091e47d6	[Inductor UT] Generalize device-bias code in test_torchinductor.py introduced by #143884 . (#144057 ) Fix #144056 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144057 Approved by: https://github.com/EikanWang, https://github.com/jansel	2025-01-04 02:24:33 +00:00
isalia20	22580f160e	Multinomial sampling fix on mps for non contiguous tensors (#141515 ) Fixes #141457 As for the tests. I looked in `test/test_mps.py` but I saw that `test_multinomial` function is disabled. Glad to add test where needed if there is some place where multinomial function is tested on metal. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141515 Approved by: https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2025-01-04 01:21:37 +00:00
Nikita Shulga	464b50dbd7	[MPSInductor] Add `floor_div` and `index_expr` implementation (#144083 ) Simply copy-n-pasted from CPPInductor `pytest test/inductor/test_torchinductor.py -k _mps` score is 418 failed, 337 passed, 32 skipped Pull Request resolved: https://github.com/pytorch/pytorch/pull/144083 Approved by: https://github.com/jansel ghstack dependencies: #144167, #144162	2025-01-04 01:10:01 +00:00
Nikita Shulga	6d25938540	[MPSInductor] Add `remainder` op (#144162 ) For it to return correct result for half precision type it must be upcast to float Pull Request resolved: https://github.com/pytorch/pytorch/pull/144162 Approved by: https://github.com/jansel ghstack dependencies: #144167	2025-01-04 00:47:40 +00:00
Nikita Shulga	f8e1eacf2f	[MPSInductor] Extend `constant` to bool type (#144167 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144167 Approved by: https://github.com/jansel	2025-01-04 00:47:40 +00:00
Yuanhao Ji	d41134f7e5	[Inductor] Fix `torch.polygamma()` when n == 0 (#144058 ) Fixes #143648 aten: `dec1a6d0f0/aten/src/ATen/native/cpu/UnaryOpsKernel.cpp (L436-L447)` compiled kernel code: ``` cpp_fused_polygamma_0 = async_compile.cpp_pybinding(['const float', 'float'], ''' #include "/tmp/torchinductor_devuser/tmpi1d9ksww/db/cdb7hyptwxpzukwd42x4ajfjlgrpum4a4htdd6lhb65apclsmno4.h" extern "C" void kernel(const float* in_ptr0, float* out_ptr0) { { { { auto tmp0 = in_ptr0[static_cast<int64_t>(0L)]; auto tmp1 = static_cast<float>(0.0); auto tmp2 = tmp1 == 0 ? calc_digamma(tmp0) : calc_polygamma(tmp0, tmp1); out_ptr0[static_cast<int64_t>(0L)] = tmp2; } } } } ''') ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/144058 Approved by: https://github.com/jansel	2025-01-04 00:22:10 +00:00
bobrenjc93	52742b07c5	remove allow-untyped-defs from nn/utils/_deprecation_utils.py (#144136 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144136 Approved by: https://github.com/aorenste	2025-01-03 23:44:14 +00:00
Xiaodong Wang	0a94bb432e	[ROCm] CK Flash Attention Backend (#143695 ) Replace https://github.com/pytorch/pytorch/pull/138947 for re-import. Replaces https://github.com/ROCm/pytorch/pull/1592 This PR contains the initial implementation of SDPA with composable_kernel backend. The CK path can be forced by simply calling torch.backends.cuda.preferred_rocm_fa_library("ck"). Similarly, you can force the incumbent aotriton implementation by passing in "aotriton" or "default". As you'd expect, not setting this option will result in aotriton to be used as the backend. In the case of CK, if pytorch deems flash attention usable, then it will use the CK path in all the same places aotriton would have been used. This PR makes no changes to the heuristics which select which attention scheme to use (i.e. flash attention vs memory efficient attention vs math etc etc). It only gets called when flash attention is both enabled (via USE_FLASH_ATTENTION) and is selected at runtime by the existing heuristics. Files located in pytorch/aten/src/ATen/native/transformers/hip/flash_attn/ck/mha* have been pulled from https://github.com/Dao-AILab/flash-attention courtesy of @tridao's hard work who is the co-author NOTE: In order to use this backend, the user MUST set USE_CK_FLASH_ATTENTION=1 in their environment when they build PyTorch. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143695 Approved by: https://github.com/malfet Co-authored-by: Andy Lugo <Andy.LugoReyes@amd.com> Co-authored-by: Jithun Nair <jithun.nair@amd.com>	2025-01-03 22:01:36 +00:00
Huy Do	3251171ae8	Make whl metadata public readable (#144164 ) After https://github.com/pytorch/pytorch/pull/143677/files#r1902138480 lands, the new nightly wheel metadata is not readable publicly causing pip install to fail, for example https://github.com/pytorch/pytorch/actions/runs/12603415308/job/35128414909. FBGEMM folks are also noticed this failure on their end (cc @q10) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144164 Approved by: https://github.com/clee2000	2025-01-03 21:08:49 +00:00
drisspg	9bf2a9a616	[ScaledMM] Fix NaNs in test for garbage input data (#144042 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144042 Approved by: https://github.com/janeyx99	2025-01-03 21:02:25 +00:00
Jay Zhang	b75f32b848	Update TorchDynamo-based ONNX Exporter memory usage example code. (#144139 ) Address related comments earlier. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144139 Approved by: https://github.com/justinchuby Co-authored-by: Justin Chu <justinchuby@users.noreply.github.com>	2025-01-03 20:41:36 +00:00
bobrenjc93	64bffb3124	remove allow-untyped-defs onnx/_internal/exporter/_fx_passes.py (#144134 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144134 Approved by: https://github.com/Skylion007	2025-01-03 20:18:40 +00:00
bobrenjc93	64b197b603	remove allow-untyped-defs from export/_remove_auto_functionalized_pass.py (#144135 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144135 Approved by: https://github.com/Skylion007	2025-01-03 20:08:11 +00:00
bobrenjc93	9b8a4e7141	remove allow-untyped-defs from torch/onnx/operators.py (#144133 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144133 Approved by: https://github.com/Skylion007	2025-01-03 20:07:56 +00:00
bobrenjc93	6e09d32c00	remove allow-untyped-defs from torch/jit/_passes/_property_propagation.py (#144132 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144132 Approved by: https://github.com/Skylion007	2025-01-03 20:07:37 +00:00
Wanchao Liang	eb7a303d21	[dtensor] expose the __create_chunk_list__ in the doc (#144100 ) as titled, this PR expose this dunder method as a public API in the doc, so that different checkpoint implementations can leverage this protocol, instead of exposing a separate API Pull Request resolved: https://github.com/pytorch/pytorch/pull/144100 Approved by: https://github.com/awgu ghstack dependencies: #144099	2025-01-03 20:06:23 +00:00
Xuehai Pan	45411d1fc9	Use absolute path `path.resolve()` -> `path.absolute()` (#129409 ) Changes: 1. Always explicit `.absolute()`: `Path(__file__)` -> `Path(__file__).absolute()` 2. Replace `path.resolve()` with `path.absolute()` if the code is resolving the PyTorch repo root directory. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129409 Approved by: https://github.com/albanD	2025-01-03 20:03:40 +00:00
bobrenjc93	e9e18a9617	remove allow-untyped-defs from _export/db/logging.py (#144093 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144093 Approved by: https://github.com/Skylion007	2025-01-03 19:36:14 +00:00
Nikita Shulga	ad09395674	[MPSInductor] Fix multi rangevar kernel invocation (#144050 ) By changing `thread_position_in_grid` type to uint{n} and passing dimentions during the kernel call `pytest test/inductor/test_torchinductor.py -k _mps` score is 445 failed, 309 passed, 32 skipped Pull Request resolved: https://github.com/pytorch/pytorch/pull/144050 Approved by: https://github.com/jansel ghstack dependencies: #144055, #144051, #144122, #144105, #144156	2025-01-03 19:32:43 +00:00
Nikita Shulga	52e107a7ca	[MPSInductor] Add `constant`, `isinf` and `isnan` ops (#144156 ) Per Table 6.5 of [Metal Language Specification](https://developer.apple.com/metal/Metal-Shading-Language-Specification.pdf) infinity is `HUGE_VALF` Pull Request resolved: https://github.com/pytorch/pytorch/pull/144156 Approved by: https://github.com/Skylion007, https://github.com/jansel ghstack dependencies: #144055, #144051, #144122, #144105	2025-01-03 19:32:43 +00:00
Catherine Lee	383ff4011c	[ez] Use strip for arg sanitization in upload_metadata_file to improve readability (#144155 ) Minor thing that improves readability. I didn't realize you could specify characters for strip when I wrote this Pull Request resolved: https://github.com/pytorch/pytorch/pull/144155 Approved by: https://github.com/huydhn, https://github.com/Skylion007	2025-01-03 19:25:30 +00:00
bobrenjc93	8b3479e361	remove allow-untyped-defs from torch/distributed/fsdp/_dynamo_utils.py (#144131 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144131 Approved by: https://github.com/Skylion007	2025-01-03 19:07:21 +00:00
Jane Xu	7b69f7b449	Clarify what we mean by decoupled weight decay in the *AdamWs (#144101 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144101 Approved by: https://github.com/albanD	2025-01-03 19:06:00 +00:00
Yidi Wu	c36f94b373	[while_loop][dynamo] auto-unspecialize int input and output to unbacked symints (#143106 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143106 Approved by: https://github.com/zou3519 ghstack dependencies: #143105, #143545	2025-01-03 19:01:07 +00:00
Yidi Wu	5660709856	[hop][BE] unify meta checking with check_meta_consistency (#143545 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143545 Approved by: https://github.com/zou3519 ghstack dependencies: #143105	2025-01-03 19:01:07 +00:00
Yidi Wu	6e8dca9ff3	[while_loop][aot] auto-unspecialize int input and output to unbacked symints (#143105 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143105 Approved by: https://github.com/zou3519	2025-01-03 19:01:07 +00:00
Davide Italiano	56f6289f6a	[mps/inductor] Add support for atanh(). (#144121 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144121 Approved by: https://github.com/jansel, https://github.com/malfet	2025-01-03 18:55:05 +00:00
Nikita Shulga	a7b61c5b49	[MPSInductor] Add signbit op support (#144105 ) By mapping it to `metal::signbit` Pull Request resolved: https://github.com/pytorch/pytorch/pull/144105 Approved by: https://github.com/jansel, https://github.com/Skylion007 ghstack dependencies: #144055, #144051, #144122	2025-01-03 18:34:46 +00:00
PyTorch MergeBot	8d63a4a409	Revert "Set `enable_trace_contextlib_contextmanager` flag to True (#140604 )" This reverts commit 1c817fe6714cec510ccc6022b2c3e66146c3ad59. Reverted https://github.com/pytorch/pytorch/pull/140604 on behalf of https://github.com/guilhermeleobas due to breaking one of the benchmarks (moco) ([comment](https://github.com/pytorch/pytorch/pull/140604#issuecomment-2569640837))	2025-01-03 18:23:53 +00:00
Animesh Jain	c5c897c3a1	[dynamo][easy] Miscellaneous fixes (#144141 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144141 Approved by: https://github.com/williamwen42 ghstack dependencies: #144129, #144130	2025-01-03 18:22:56 +00:00
Animesh Jain	732359c633	[dynamo][easy] Minor fixes in guards.cpp (#144130 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144130 Approved by: https://github.com/williamwen42 ghstack dependencies: #144129	2025-01-03 18:22:56 +00:00
Animesh Jain	a450e177fd	[dynamo] remove inline inbuilt tests as flag is enabled by default (#144129 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144129 Approved by: https://github.com/williamwen42	2025-01-03 18:22:56 +00:00
PyTorch MergeBot	2409b49a33	Revert "Rewrite _reparametrize_module to use `contextmanager` (#138203 )" This reverts commit 7bf3b7cdc5631f9991eebcdd8ec09095339a9973. Reverted https://github.com/pytorch/pytorch/pull/138203 on behalf of https://github.com/guilhermeleobas due to breaking one of the benchmarks (moco) ([comment](https://github.com/pytorch/pytorch/pull/138203#issuecomment-2569634001))	2025-01-03 18:17:32 +00:00
Blaine Burton Rister	60fe8a65af	[Inductor] Generalize tiling algorithm to handle fused reductions (#144041 ) # Issue This PR cleans up an edge case that wasn't handled by https://github.com/pytorch/pytorch/pull/137243. The existing tiling code assumes that `node.get_ranges()` is a reliable source of pointwise and reduction numels. This is true for pointwise kernels, but the situation is more complicated with reductions. Since reductions change the number of elements in a tensor, not all ops within a reduction kernel will have the same number of iterations. For example, `var_mean` fuses pointwise division with the output of reduction sum, and the division lacks the corresponding reduction ranges. # Fix Instead of getting numels from `node.get_ranges()`, explicitly pass the global pointwise and reduction numels to the relevant tiling functions. In `SIMDKernel.complete_partial_tiling`, we solve for the missing numel by diving the global numel by the partial tiling's numel. This ensures all tilings have the correct global numel. Also, in `SIMDKernel.is_compatible`, add the global reduction numel to node ranges that are missing it. For example, `{"x": 8, "r0_": 8}` is compatible with a node of ranges `([8], [])` when we have `reduction_numel=8`. Finally, this PR generalizes some of the existing codegen to handle multiple reduction dims. We already had code to ignore reduction splits for pointwise kernels, but it only worked for 1D reductions. Now it can handle ND. # Test plan This PR parametrizes the existing CI test for `var_mean` to also run with tiled reductions. It also adds a new test checking that `var_mean` generates 2D tilings (with tiled reduction enabled). These new tests would fail on the current main branch. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144041 Approved by: https://github.com/jansel	2025-01-03 18:16:27 +00:00
Colin Peppler	e93f625d00	[AOTI] don't codegen autotune_at_compile_time for non-Triton kernels (#143990 ) `autotune_at_compile_time` is a separate codegen file specifically for autotuning Triton kernels. We can skip it for non-Triton kernels (like CUTLASS). This test (test_aoti_workspace_ptr) checks that `workspace_0.data_ptr()` is codegen-ed correctly in AOTI. ``` // in AOTI codegen kernels.cuda_fused_0( (const half)arg0_1.data_ptr(), (const half)arg1_1.data_ptr(), (half)buf0.data_ptr(), (int)200, (int)5216, (int)10432, (int)10432, (int)5216, (int)0, (int)5216, (size_t)nullptr, (uint8_t*)workspace_0.data_ptr(), stream); ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/143990 Approved by: https://github.com/henrylhtsang, https://github.com/chenyang78, https://github.com/desertfire	2025-01-03 18:01:12 +00:00
Huy Do	f3968373c1	Migrate the rest of CUDA 12.1 jobs to 12.4 (#144118 ) CUDA 12.4 is the default now and we don't build nightly 12.1 anymore, so it's time to move the rest of CI jobs to 12.4. I also clean up some redundant CI jobs on periodic and inductor-periodic. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144118 Approved by: https://github.com/atalman	2025-01-03 17:45:41 +00:00
Huy Do	cbdc70ae07	Use the build environment as sccache prefix instead of workflow name (#144112 ) This is an attempt to improve cache usage for jobs in non-pull workflows like periodic, slow, or inductor as we are seeing build timeout there from time to time, for example https://github.com/pytorch/pytorch/actions/runs/12553928804. The build timeout never happens in pull or trunk AFAICT because they are more up to date with the cache content coming from the PR itself. Logically, the same build should use the same cache regardless of the workflows. We have many examples where the same build, for example [linux-focal-cuda12.4-py3.10-gcc9-sm86](https://github.com/search?q=repo%3Apytorch%2Fpytorch+linux-focal-cuda12.4-py3.10-gcc9-sm86&type=code), is split between different workflows and, thus, uses different caches. I could gather some sccache stats from CH in the meantime to try to prove the improvement before and after this lands. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144112 Approved by: https://github.com/malfet	2025-01-03 17:33:03 +00:00
Benjamin Glass	b9fbd65dfd	AOTI fallback ops: remove ops that were never codegen'ed (#143421 ) Removes 4 fallback ops that are currently not possible to codegen, which does not break ABI-compatibility. 1. `_cudnn_rnn_backward` and `_histogramdd_bin_edges` both return `Tensor[]`, which we cannot codegen with the current design. 2. `_sparse_coo_tensor_with_dims_and_tensors` only supplies a Sparse operator, which we don't support. 3. `zeros.names` requires a `Dimname` input, which we can't currently codegen. Removing these ops from the list will improve test performance, since the fallback op generation will use the Python proxy executor instead of calling non-existent C functions. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143421 Approved by: https://github.com/desertfire ghstack dependencies: #141371, #143223	2025-01-03 16:05:38 +00:00
Benjamin Glass	b5b419d627	cpp_wrapper: Use runtime dispatched fallbacks for complex ops (#143223 ) When calling a fallback op in cpp_wrapper mode, where any of the inputs are complex numbers, utilize the runtime dispatched fallback mode. This properly handles the Conjugate and Negative dispatch keys, if present, in exchange for a performance pessimization in complex arithmetic. This PR additionally fixes some cascading failure modes exposed in our `aot_inductor` tests by this change. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143223 Approved by: https://github.com/desertfire ghstack dependencies: #141371	2025-01-03 16:05:38 +00:00
Benjamin Glass	e88d06f54e	ir.ExternKernel: correctly handle kwarg default arguments (#141371 ) Additionally, enable torchinductor opinfo tests exercising all previously fixed bugs in this stack. Note: I've manually sharded the cpp_wrapper CI checks into 2 shards. Once all OpInfo tests are enabled we should switch back to automatic sharding, but until then the pipeline doesn't have appropriate timing stats. More shards would be helpful given the compilation slowdown associated with cpp_wrapper, but 2 will do for now. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141371 Approved by: https://github.com/desertfire	2025-01-03 16:05:31 +00:00
Nikita Shulga	f7644efa79	[MPSInductor][EZ] Fix logical_[or\|end] ops (#144122 ) For boolean operands it does not really matter whether `&` or `&&` is used, but if one ever to rely on operator precedence, then bitwise ops should have higher precendence than logical ones Pull Request resolved: https://github.com/pytorch/pytorch/pull/144122 Approved by: https://github.com/huydhn ghstack dependencies: #144055, #144051	2025-01-03 15:28:07 +00:00
Nikita Shulga	b336d72dae	[MPSInductor] Preserve dtype during load (#144051 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144051 Approved by: https://github.com/Skylion007 ghstack dependencies: #144055	2025-01-03 15:17:33 +00:00
Valentine233	a1ae8fadc7	[cpu][vec] support reduce ops for add and max (#144065 ) ### Description During the support of INT8 SDPA https://github.com/pytorch/ao/pull/1372, we find that `at::vec::vec_reduce_all<int32_t>` would go into slow scalar path when doing sum and max. So here, we support the two reduce-related ops `reduce_add` and `reduce_max` for `vec512` and `vec256`, using the Sequence instructions. ### Details - Support vectorized `reduce_add` and `reduce_max` for dtypes `int32` and `float32`, using the Sequence instructions; - Implement the scalar version for fallback path in vec base; - Add the operator `reduce` in vec base, in order to simplify the codes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144065 Approved by: https://github.com/mingfeima	2025-01-03 13:01:52 +00:00
Michael Diggin	55dc61dd52	Dataloader distribute tasks to workers when in_order is False (#142324 ) Fixes #105203 and is a follow up PR to #141833 When `in_order` is True (the default), tasks are given out to workers in a round robin fashion. When `in_order` is False this is no longer needed, as we give up guarantees of reproducibility, and instead tasks should be given to workers that are able to perform work. In this PR I've added tracking of the number of outstanding tasks for each worker (updated when tasks are added to their queue, and when data is returned to the main thread). When finding the next queue to add a task to, if `in_order` is False it will only add the task to the workers queue if it has fewer than `_prefetch_factor` tasks outstanding. The current default behaviour is left as is. Tests are also updated to assert on the worker IDs for each sample of data returned. I've run the following to confirm they aren't flaky ```bash for i in {1..20}; do python test/test_dataloader.py TestOutOfOrderDataLoader; done ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/142324 Approved by: https://github.com/andrewkho	2025-01-03 12:57:04 +00:00
blzheng	c09bf71bd6	[Inductor][CPU] Fix C++ compile error of torch.max on bool type (#143848 ) Fix https://github.com/pytorch/pytorch/issues/143568 Before: ![image](https://github.com/user-attachments/assets/3e1e869e-7ae7-45c0-a334-8a663028e003) After: ![image](https://github.com/user-attachments/assets/91f72920-64bd-449a-a6c6-6048409c1450) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143848 Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel	2025-01-03 09:00:43 +00:00
Xuehai Pan	d9507548d8	[dynamo][BE] move `zip_longest` polyfill to submodule `polyfills.itertools` (#144067 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144067 Approved by: https://github.com/yanboliang ghstack dependencies: #144066	2025-01-03 08:08:31 +00:00
Xuehai Pan	fb1beb31d2	[dynamo][BE] move `dropwhile` polyfill to submodule `polyfills.itertools` (#144066 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144066 Approved by: https://github.com/jansel	2025-01-03 08:08:31 +00:00
hongxyan	00df63f09f	[ROCm] Fix for ld failed to convert GOTPCREL relocation in PyTorch build (#143986 ) I experienced an error while doing a DEBUG build of pytorch on rocm: ``` additional relocation overflows omitted from the output /usr/bin/ld: failed to convert GOTPCREL relocation; relink with --no-relax ``` Based on discussions on similar issue #138427, I fixed it after adding the `--offload-compress` to the HIP_HIPCC_FLAGS to successfully build DEBUG mode on my node. Further updated the PR to enable the flag for non-DEBUG builds as well due to the size reduction. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143986 Approved by: https://github.com/jeffdaily	2025-01-03 06:53:08 +00:00
Xu Han	e141cb9c34	export AOTI_TORCH_EXPORT on Windows. (#140030 ) Fixes #139954 reproduce UT: ```cmd pytest test/inductor/test_torchinductor_codegen_dynamic_shapes.py -k test_device_assert_dynamic_shapes_cpu ``` Issue: <img width="856" alt="image" src="https://github.com/user-attachments/assets/5fc501a9-54e5-45ac-9fb3-509ec11a7abe"> After fixing: ![Image](https://github.com/user-attachments/assets/883846fb-8e92-4b9c-9400-daab32382a3a) Reland: 1. Declare export on Windows explicitly. 2. Support cpu, cuda and xpu devices. Pull Request resolved: https://github.com/pytorch/pytorch/pull/140030 Approved by: https://github.com/jgong5, https://github.com/desertfire	2025-01-03 05:41:06 +00:00
Wanchao Liang	48a05ee773	[dtensor] improve doc of the DTensor class (#144099 ) as titled: explicitly list all public members to make sure the public API stays consistent, also use groupwise as the member order to make doc look better Pull Request resolved: https://github.com/pytorch/pytorch/pull/144099 Approved by: https://github.com/awgu	2025-01-03 05:35:44 +00:00
Davide Italiano	41b5c600df	[ReduceOps] Add dimension checking for cummin()/cummax(). (#143920 ) Summary: cum{min,max} didn't guard against 0-d vector and allowed an arbitrary dimension to be passed. Test Plan: torch_test.py Reviewers: Subscribers: Tasks: Tags: Fixes #71477 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143920 Approved by: https://github.com/malfet	2025-01-03 04:14:33 +00:00
Bin Bao	c5b75f8db1	[AOTI] Remove more AOTI_TORCH_EXPORT (#144081 ) Summary: Similar to https://github.com/pytorch/pytorch/pull/142500, remove redundant AOTI_TORCH_EXPORT from several cpp files, to solve a windows build issue. Differential Revision: D67762069 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144081 Approved by: https://github.com/yushangdi	2025-01-03 02:17:38 +00:00
Jithun Nair	c31912666e	[ROCm] Print amdgpu info on bare metal for CI runners (#144038 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144038 Approved by: https://github.com/jeffdaily	2025-01-03 02:00:40 +00:00
Michal Gallus	37e9da0687	[ROCm][Windows] Disable roctracer-related code (#143329 ) Currently, the roctracer for Windows is not available. This PR disables any mentions of its usage for Windows, and creates dummy functions for Windows to keep compatibility with existing code, but which warn the user about the lack of Windows' availability. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143329 Approved by: https://github.com/sraikund16	2025-01-03 01:51:01 +00:00
bobrenjc93	891a86d1ad	remove allow-untyped-defs from ao/quantization/experimental/fake_quantize.py (#144091 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144091 Approved by: https://github.com/aorenste	2025-01-03 01:26:36 +00:00
bobrenjc93	377e29745f	remove allow-untyped-defs from distributed/elastic/utils/data/cycling_iterator.py (#144090 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144090 Approved by: https://github.com/aorenste	2025-01-03 01:22:50 +00:00
bobrenjc93	0d6db839a7	remove allow-untyped-defs from utils/data/datapipes/iter/streamreader.py (#144088 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144088 Approved by: https://github.com/aorenste	2025-01-03 01:21:44 +00:00
bobrenjc93	bdfb40ed29	remove allow-untyped-defs from utils/_import_utils.py (#144089 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144089 Approved by: https://github.com/aorenste	2025-01-03 01:21:13 +00:00
bobrenjc93	28a74fe3aa	remove allow-untyped-defs from torch/mps/event.py (#144092 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144092 Approved by: https://github.com/aorenste	2025-01-03 01:20:17 +00:00
Catherine Lee	496fc90965	[CI] Multigpu 1 -> 2 shards (#143992 ) Fixes #ISSUE_NUMBER It's been timing out https://github.com/pytorch/pytorch/actions/runs/12544191739/job/34977636276 They're still somewhat uneven but they're both under the limit now. It would probably be better to use run_test.py's sharding to do this, maybe in another PR Pull Request resolved: https://github.com/pytorch/pytorch/pull/143992 Approved by: https://github.com/huydhn	2025-01-03 00:33:16 +00:00
Catherine Lee	3eb3f4ed55	Upload METADATA file with whl binaries (#143677 ) Upload the metadata file for wheels for pep658 https://peps.python.org/pep-0658/ Using a python script but using bash might be easier... -- Testing Example run https://github.com/pytorch/pytorch/actions/runs/12550595201/job/34994883276 without actual upload, just dry run Lightly tested the script to make sure it uploads to s3, but integration with the bash script + workflow is untested Pull Request resolved: https://github.com/pytorch/pytorch/pull/143677 Approved by: https://github.com/seemethere	2025-01-03 00:32:05 +00:00
Catherine Lee	bb5e439f2d	Add networkx as bazel dep to fix CI failure (#143995 ) Add networkx as a dependency for test_bazel Example failure: https://github.com/pytorch/pytorch/actions/runs/12551752021/job/34996706301 ``` INFO: From Testing //:test_bazel: ==================== Test output for //:test_bazel: Traceback (most recent call last): File "/var/lib/jenkins/.cache/bazel/_bazel_jenkins/fdf6d09bf4b4f04a71e2a7dfceb40620/sandbox/processwrapper-sandbox/6504/execroot/pytorch/bazel-out/k8-fastbuild/bin/test_bazel.runfiles/pytorch/test/_test_bazel.py", line 33, in <module> test_simple_compile_eager() File "/var/lib/jenkins/.cache/bazel/_bazel_jenkins/fdf6d09bf4b4f04a71e2a7dfceb40620/sandbox/processwrapper-sandbox/6504/execroot/pytorch/bazel-out/k8-fastbuild/bin/test_bazel.runfiles/pytorch/test/_test_bazel.py", line 27, in test_simple_compile_eager opt_foo1 = torch.compile(foo, backend="eager") File "/var/lib/jenkins/.cache/bazel/_bazel_jenkins/fdf6d09bf4b4f04a71e2a7dfceb40620/sandbox/processwrapper-sandbox/6504/execroot/pytorch/bazel-out/k8-fastbuild/bin/test_bazel.runfiles/pytorch/torch/__init__.py", line 2533, in compile backend = _TorchCompileWrapper(backend, mode, options, dynamic) File "/var/lib/jenkins/.cache/bazel/_bazel_jenkins/fdf6d09bf4b4f04a71e2a7dfceb40620/sandbox/processwrapper-sandbox/6504/execroot/pytorch/bazel-out/k8-fastbuild/bin/test_bazel.runfiles/pytorch/torch/__init__.py", line 2342, in __init__ self.compiler_fn = lookup_backend(backend) File "/var/lib/jenkins/.cache/bazel/_bazel_jenkins/fdf6d09bf4b4f04a71e2a7dfceb40620/sandbox/processwrapper-sandbox/6504/execroot/pytorch/bazel-out/k8-fastbuild/bin/test_bazel.runfiles/pytorch/torch/_dynamo/backends/registry.py", line 66, in lookup_backend _lazy_import() File "/var/lib/jenkins/.cache/bazel/_bazel_jenkins/fdf6d09bf4b4f04a71e2a7dfceb40620/sandbox/processwrapper-sandbox/6504/execroot/pytorch/bazel-out/k8-fastbuild/bin/test_bazel.runfiles/pytorch/torch/_dynamo/backends/registry.py", line 102, in _lazy_import import_submodule(backends) File "/var/lib/jenkins/.cache/bazel/_bazel_jenkins/fdf6d09bf4b4f04a71e2a7dfceb40620/sandbox/processwrapper-sandbox/6504/execroot/pytorch/bazel-out/k8-fastbuild/bin/test_bazel.runfiles/pytorch/torch/_dynamo/utils.py", line 2797, in import_submodule importlib.import_module(f"{mod.__name__}.{filename[:-3]}") File "/var/lib/jenkins/.cache/bazel/_bazel_jenkins/fdf6d09bf4b4f04a71e2a7dfceb40620/execroot/pytorch/external/python3_10_x86_64-unknown-linux-gnu/lib/python3.10/importlib/__init__.py", line 126, in import_module return _bootstrap._gcd_import(name[level:], package, level) File "<frozen importlib._bootstrap>", line 1050, in _gcd_import File "<frozen importlib._bootstrap>", line 1027, in _find_and_load File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked File "<frozen importlib._bootstrap>", line 688, in _load_unlocked File "<frozen importlib._bootstrap_external>", line 883, in exec_module File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed File "/var/lib/jenkins/.cache/bazel/_bazel_jenkins/fdf6d09bf4b4f04a71e2a7dfceb40620/sandbox/processwrapper-sandbox/6504/execroot/pytorch/bazel-out/k8-fastbuild/bin/test_bazel.runfiles/pytorch/torch/_dynamo/backends/common.py", line 12, in <module> from torch._functorch.aot_autograd import ( File "/var/lib/jenkins/.cache/bazel/_bazel_jenkins/fdf6d09bf4b4f04a71e2a7dfceb40620/sandbox/processwrapper-sandbox/6504/execroot/pytorch/bazel-out/k8-fastbuild/bin/test_bazel.runfiles/pytorch/torch/_functorch/aot_autograd.py", line 147, in <module> from .partitioners import default_partition File "/var/lib/jenkins/.cache/bazel/_bazel_jenkins/fdf6d09bf4b4f04a71e2a7dfceb40620/sandbox/processwrapper-sandbox/6504/execroot/pytorch/bazel-out/k8-fastbuild/bin/test_bazel.runfiles/pytorch/torch/_functorch/partitioners.py", line 31, in <module> from ._activation_checkpointing.graph_info_provider import GraphInfoProvider File "/var/lib/jenkins/.cache/bazel/_bazel_jenkins/fdf6d09bf4b4f04a71e2a7dfceb40620/sandbox/processwrapper-sandbox/6504/execroot/pytorch/bazel-out/k8-fastbuild/bin/test_bazel.runfiles/pytorch/torch/_functorch/_activation_checkpointing/graph_info_provider.py", line 3, in <module> import networkx as nx ModuleNotFoundError: No module named 'networkx' ``` No periodic runs on this PR or its main branch commit, but I'm pretty sure its started on https://togithub.com/pytorch/pytorch/pull/143539 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143995 Approved by: https://github.com/huydhn	2025-01-02 19:42:18 +00:00
Driss Guessous	a8c98ce175	[cutlass-3] Update third-party/cutlass-3 from 3.4 to 3.5.1 (#143515 ) # Summary: This also makes updates to different repositories throughout FB code to roll any updates needed for this new release. I was not able to get AsyncMM.cu to build (still trying) Yfiu suggested that I just skip it for now Test Plan: Have run various build commands to try and expose errors Pull Request resolved: https://github.com/pytorch/pytorch/pull/143515 Approved by: https://github.com/eqy, https://github.com/Skylion007	2025-01-02 18:45:11 +00:00
bobrenjc93	8506a2af9a	remove allow-untyped-defs from _export/pass_infra/proxy_value.py (#143944 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143944 Approved by: https://github.com/aorenste ghstack dependencies: #143943	2025-01-02 18:17:03 +00:00
Jagadish Krishnamoorthy	8f3eb84373	ROCm: Enable 4 gpu tests for distributed config (#140319 ) Change the label to make sure the jobs land on a node which has >= 4 GPUs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/140319 Approved by: https://github.com/jeffdaily, https://github.com/malfet, https://github.com/kwen2501	2025-01-02 17:22:11 +00:00
Chris Sidebottom	916b510ff5	Enable mkldnn pattern matcher tests for BF16 on AArch64 (#144030 ) Fixes #143146 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144030 Approved by: https://github.com/malfet	2025-01-02 17:13:38 +00:00
Nikita Shulga	a93e75d1e2	[MPS] Handle implicit cpu-scalar-to-gpu transfer (#144055 ) Followup after https://github.com/pytorch/pytorch/pull/143934, this check is no longer necessary and fixes a subset of inductor tests Before `pytest test/inductor/test_torchinductor.py -k _mps` reports 463 failed, 291 passed, 32 skipped after 456 failed, 298 passed, 32 skipped Pull Request resolved: https://github.com/pytorch/pytorch/pull/144055 Approved by: https://github.com/Skylion007	2025-01-02 17:12:39 +00:00
Wanchao Liang	0431d47eaa	[tp] propagate src_data_rank kwarg in TP API (#144005 ) as titled, this PR propagates the src_data_rank in the TP API, so that module level APIs could leverage the flexibility to choose src_data_rank, and avoid the communication if it does not need to Pull Request resolved: https://github.com/pytorch/pytorch/pull/144005 Approved by: https://github.com/tianyu-l ghstack dependencies: #143883	2025-01-02 05:35:52 +00:00
Wanchao Liang	f242dbb76f	[dtensor] add src_data_rank to distribute_tensor API (#143883 ) As titled, this PR add a kwarg src_data_rank to the distribute_tensor API, to allow user specify a specific rank as the full tensor source data. Previously we by default specify group_rank=0 as the source of truth for single device semantic, this new option: * gives advanced user flexiblity to choose the source data rank * allow user to specify None explicity, which means we will skip the communications needed (scatter/broadcast) for the cases that does not care about single device semantic (i.e. loading from a checkpoint) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143883 Approved by: https://github.com/XilunWu, https://github.com/tianyu-l	2025-01-02 05:35:52 +00:00
Animesh Jain	dec1a6d0f0	[dynamo] Separate out GetItemSource and DictGetItemSource (#143926 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143926 Approved by: https://github.com/jansel	2025-01-01 02:39:41 +00:00
Wenqin Yang	8d9ff9c8a4	Fix a bug for wrong stride in fake tensor (#141427 ) Fixes #141426 Please see details in the issue. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141427 Approved by: https://github.com/jansel	2024-12-31 23:45:32 +00:00
Jason Ansel	e7ed660233	[inductor] Add missing py312 xfail (#144006 ) See #144006 ```py __________________________________________ CudaReproTests.test_repeated_masked_load __________________________________________ RuntimeError: First class dim doesn't work with python 3.12 The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/home/jansel/conda/envs/pytorch/lib/python3.12/unittest/case.py", line 58, in testPartExecutor yield File "/home/jansel/conda/envs/pytorch/lib/python3.12/unittest/case.py", line 634, in run self._callTestMethod(testMethod) File "/home/jansel/conda/envs/pytorch/lib/python3.12/unittest/case.py", line 589, in _callTestMethod if method() is not None: ^^^^^^^^ File "/home/jansel/pytorch/torch/testing/_internal/common_utils.py", line 3108, in wrapper method(args, *kwargs) File "/home/jansel/pytorch/test/inductor/test_cuda_repro.py", line 1678, in test_repeated_masked_load from functorch.einops import rearrange File "/home/jansel/pytorch/functorch/einops/__init__.py", line 1, in <module> from .rearrange import rearrange File "/home/jansel/pytorch/functorch/einops/rearrange.py", line 7, in <module> from functorch._C import dim as _C ImportError: initialization failed ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/144006 Approved by: https://github.com/Skylion007	2024-12-31 23:37:05 +00:00
PyTorch MergeBot	a174ee2255	Revert "Fix duplicate pattern error (#139321 )" This reverts commit 9e8d84f8631317ce61de4f0f9731fc1b1ccc3d2b. Reverted https://github.com/pytorch/pytorch/pull/139321 on behalf of https://github.com/jeanschmidt due to breaking internal signals ([comment](https://github.com/pytorch/pytorch/pull/139321#issuecomment-2566620095))	2024-12-31 17:44:02 +00:00
PyTorch MergeBot	d8a2796fb6	Revert "[Inductor UT] Generalize newly introduced device-bias hard code in (#143975 )" This reverts commit 7c1c0730beed9bb05a16ba678a8f32b29fdd0a29. Reverted https://github.com/pytorch/pytorch/pull/143975 on behalf of https://github.com/jeanschmidt due to Need to revert in order to be able to revert https://github.com/pytorch/pytorch/pull/139321 feel free to merge it back once conflicts are cleared ([comment](https://github.com/pytorch/pytorch/pull/143975#issuecomment-2566619312))	2024-12-31 17:41:06 +00:00
PyTorch MergeBot	eec30916e7	Revert "Update low prec codegen for div/mod (#142350 )" This reverts commit 135a2d44830b2de1ed6714f52cc6a612406adb6d. Reverted https://github.com/pytorch/pytorch/pull/142350 on behalf of https://github.com/jeanschmidt due to breaking internal signals ([comment](https://github.com/pytorch/pytorch/pull/142350#issuecomment-2566615835))	2024-12-31 17:35:32 +00:00
Nikita Shulga	5ef0de7615	[MPSInductor] Fix multiple kernel generation (#143998 ) At the moment by generating multiple MetalLibraries `pytest test/inductor/test_torchinductor.py -k _mps` score is 434 failed, 317 passed, 32 skipped Pull Request resolved: https://github.com/pytorch/pytorch/pull/143998 Approved by: https://github.com/jansel, https://github.com/ruidazeng ghstack dependencies: #143948, #143949, #143973, #143977	2024-12-31 13:51:50 +00:00
Nikita Shulga	f0f09bb3c2	[MPSInductor] Implement minimum and maximum ops (#143977 ) By calling `metal::min` and `metal::max` respectively with argument typecast to a common type to avoid ambiguous calls errors TODO: Implement NaN propagation for both eager and compile, see https://github.com/pytorch/pytorch/issues/143976 `pytest test/inductor/test_torchinductor.py -k _mps` score is 460 failed, 291 passed, 32 skipped Pull Request resolved: https://github.com/pytorch/pytorch/pull/143977 Approved by: https://github.com/jansel ghstack dependencies: #143948, #143949, #143973	2024-12-31 13:51:50 +00:00
Yu, Guangye	09e47ab7ab	Refine CUDA Stream priority (#143849 ) # Motivation As mentioned in https://github.com/pytorch/pytorch/pull/141119#discussion_r1897480515, we properly handle the priority value if it is outside of the priority range. # Additional Context If the value falls outside of the allowed priority range, it will automatically be mapped to the nearest valid priority(either lowest or highest). Pull Request resolved: https://github.com/pytorch/pytorch/pull/143849 Approved by: https://github.com/albanD, https://github.com/EikanWang ghstack dependencies: #142347, #141119, #141123, #143799	2024-12-31 11:15:59 +00:00
Yu, Guangye	3848de55ed	Add get_stream_from_external API for CUDA backend (#143799 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143799 Approved by: https://github.com/albanD, https://github.com/EikanWang ghstack dependencies: #142347, #141119, #141123	2024-12-31 11:15:59 +00:00
Yu, Guangye	8f6c4d1732	Add get_stream_from_external API for XPU backend (#141123 ) # Motivation This PR aims to introduce `torch.xpu.ExternalStream` to be used to wrap SYCL queue created in other libraries to PyTorch. # Additional Context Pull Request resolved: https://github.com/pytorch/pytorch/pull/141123 Approved by: https://github.com/albanD, https://github.com/EikanWang ghstack dependencies: #142347, #141119	2024-12-31 11:15:52 +00:00
Yu, Guangye	a68c0ca497	Add low priority XPU Stream (#141119 ) # Motivation Due to the potential for the external SYCL queue to have a low priority, we need to support the low-priority SYCL queue for native XPU Streams to maintain consistency. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141119 Approved by: https://github.com/gujinghui, https://github.com/albanD ghstack dependencies: #142347	2024-12-31 11:15:45 +00:00
Yu, Guangye	39450ae655	Refine XPU external Stream (#142347 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/142347 Approved by: https://github.com/gujinghui, https://github.com/albanD	2024-12-31 11:15:38 +00:00
Vinayak Pandey	16a57e232c	removed dead code for dynamo flag dead_code_elimination (#140938 ) Fixes #136862 1. removed dead code from torch/_dynamo/convert_frame.py 2. ran `lintrunner -a` and all the tests passed. 3. ran the unit tests and everything seems to be in order. Pull Request resolved: https://github.com/pytorch/pytorch/pull/140938 Approved by: https://github.com/zou3519	2024-12-31 09:27:43 +00:00
xinan.lin	01034e963c	[AOTI] Not use AOTI_TORCH_CHECK in non AOTI mode. (#143970 ) Fix #143967 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143970 Approved by: https://github.com/EikanWang, https://github.com/jansel	2024-12-31 06:28:32 +00:00
Blaine Burton Rister	a2753e376b	[Inductor] Support tiling reduction dimensions (#137243 ) Fixes #134277 and https://github.com/pytorch/pytorch/issues/142317. Sub-PRs containing refactors from this one: - https://github.com/pytorch/pytorch/pull/141733 - https://github.com/pytorch/pytorch/pull/141738 - https://github.com/pytorch/pytorch/pull/141751 (based off the former) - https://github.com/pytorch/pytorch/pull/142249 - https://github.com/pytorch/pytorch/pull/142020 - https://github.com/pytorch/pytorch/pull/143135 These refactor PRs should land before the main one. # Feature Note: to minimize risk, multi-dimensional reductions are gated by the flag `config.triton.tile_reductions`, which defaults to False. Instead of having a single reduction dimension called `"r"`, we can now support 2D reductions with `"r0_"` and `"r1_"` dimensions. 2D reductions generate two nested loops, with different block pointer advancements in each loop body. Most of the implementation is generic to ND reductions, but for now the tiling algorithm sets a hard limit at 2D. Here's an example of a 2D persistent reduction kernel: ``` @triton.jit def triton_per_fused_sum_0(in_ptr0, out_ptr0, xnumel, r0_numel, r1_numel, XBLOCK : tl.constexpr): xnumel = 1 r0_numel = 15 R0_BLOCK: tl.constexpr = 16 r1_numel = 15 R1_BLOCK: tl.constexpr = 16 xoffset = tl.program_id(0) * XBLOCK xindex = xoffset + tl.arange(0, XBLOCK)[:, None, None] xmask = tl.full([XBLOCK, R0_BLOCK, R1_BLOCK], True, tl.int1) r0_index = tl.arange(0, R0_BLOCK)[None, :, None] r0_offset = 0 r0_mask = r0_index < r0_numel r1_index = tl.arange(0, R1_BLOCK)[None, None, :] r1_offset = 0 r1_mask = r1_index < r1_numel rnumel = r0_numel * r1_numel RBLOCK: tl.constexpr = R0_BLOCKR1_BLOCK roffset = r1_offset + (r0_offsetr1_numel) rindex = r1_index + (r0_indexr1_numel) r0_0 = r0_index r1_1 = r1_index tmp0 = tl.load(tl.make_block_ptr(in_ptr0, shape=[15, 15], strides=[30, 1], block_shape=[R0_BLOCK, R1_BLOCK], order=[1, 0], offsets=[r0_offset, r1_offset]), boundary_check=[0, 1], padding_option='zero')[None, :, :] tmp1 = tl.broadcast_to(tmp0, [XBLOCK, R0_BLOCK, R1_BLOCK]) tmp3 = tl.where(r0_mask & r1_mask, tmp1, 0) tmp4 = tl.reshape(tmp3, [XBLOCK, RBLOCK]) tmp5 = tl.sum(tmp4, 1)[:, None, None] tl.store(out_ptr0 + (tl.full([XBLOCK, 1, 1], 0, tl.int32)), tmp5, None) ''', device_str='cuda') ``` There are a few main differences between this kernel and what Inductor would generate without this PR. - Instead of an `r`/`RBLOCK` dimension, we have two reduction dimensions: `r0_`/`R0_BLOCK` and `r1_`/`R1_BLOCK`. - There are special size and indexing variables for reductions, which don't directly correspond to any kernel dimension. (`rindex`, `rnumel`, `RBLOCK`, and `roffset`.) These collapse N-D reduction sizes and indices indices into 1D. This simplifies the codegen for reductions, which sometimes want to access linear indices instead of N-dimensional ones. Doing things this way allows us to generate N-D loads and stores, but access this data as if it were 1D, minimizing the blast radius of this PR. Although this makes the code more verbose, it shouldn't have a perf impact because the triton compiler eliminates dead code. - We generate the line `tmp4 = tl.reshape(tmp3, [XBLOCK, RBLOCK])` before performing the actual reduction. This reshapes N reduction dimensions into 1D. This allows us to reduce over all N dimensions at once, simplifying the codegen and allowing the Triton complier to decide the order of processing under the hood. Here's an example of a looped reduction: ``` @triton.jit def triton_red_fused_sum_0(in_ptr0, out_ptr0, xnumel, r0_numel, r1_numel, XBLOCK : tl.constexpr, R0_BLOCK : tl.constexpr, R1_BLOCK : tl.constexpr): xnumel = 3 r0_numel = 43 r1_numel = 129 xoffset = tl.program_id(0) XBLOCK xindex = xoffset + tl.arange(0, XBLOCK)[:, None, None] xmask = xindex < xnumel r0_base = tl.arange(0, R0_BLOCK)[None, :, None] r1_base = tl.arange(0, R1_BLOCK)[None, None, :] rnumel = r0_numel * r1_numel RBLOCK: tl.constexpr = R0_BLOCKR1_BLOCK rbase = r1_base + (r0_baser1_numel) x0 = xindex block_ptr0 = tl.make_block_ptr(in_ptr0, shape=[3, 43, 129], strides=[11094, 258, 1], block_shape=[XBLOCK, R0_BLOCK, R1_BLOCK], order=[2, 1, 0], offsets=[xoffset, 0, 0]) _tmp2 = tl.full([XBLOCK, R0_BLOCK, R1_BLOCK], 0, tl.float32) for r0_offset in range(0, r0_numel, R0_BLOCK): r0_index = r0_offset + r0_base r0_mask = r0_index < r0_numel for r1_offset in range(0, r1_numel, R1_BLOCK): r1_index = r1_offset + r1_base r1_mask = r1_index < r1_numel roffset = r1_offset + (r0_offsetr1_numel) rindex = r1_index + (r0_indexr1_numel) r0_1 = r0_index r1_2 = r1_index tmp0 = tl.load(block_ptr0, boundary_check=[0, 1, 2], padding_option='zero', eviction_policy='evict_first') tmp1 = tl.broadcast_to(tmp0, [XBLOCK, R0_BLOCK, R1_BLOCK]) tmp3 = _tmp2 + tmp1 _tmp2 = tl.where(r0_mask & r1_mask & xmask, tmp3, _tmp2) block_ptr0 = tl.advance(block_ptr0, [0, 0, R1_BLOCK]) block_ptr0 = tl.advance(block_ptr0, [0, R0_BLOCK, (-1)R1_BLOCK((128 + R1_BLOCK) // R1_BLOCK)]) tmp4 = tl.reshape(_tmp2, [XBLOCK, RBLOCK]) tmp2 = tl.sum(tmp4, 1)[:, None, None] tl.store(tl.make_block_ptr(out_ptr0, shape=[3], strides=[1], block_shape=[XBLOCK], order=[0], offsets=[xoffset]), tl.reshape(tmp2, [XBLOCK]).to(tl.float32), boundary_check=[0]) ''', device_str='cuda') ``` In addition to the aforementioned changes to the persistent reduction, multidimensional looped reductions have a few more lines of code: - They calculate indices inside the loop using `r0_base` and `r1_base`. For compatibility with existing codegen, these are collapsed to the 1D variant `rbase`. - Block pointer advancements are more nuanced for multidimensional loops. At the end of each loop body, we emit a `tl.advance` line which not only increments the pointer in its own dimension, but also undoes the cumulative increments of the previous loop level. This is equivalent to the usual practice in nested loops of starting with a fresh iteration variable at each level. Implementing this required refactoring the way we generate pointer advancements into a new `self.pointer_advancements` field of the kernel, which categorizes advancements by dimension. The biggest difficulty in implementing this feature was that we represented tiling with a tuple like `(5,2)`. In the existing codebase, the compiler can infer that the reduction dimension of `(5,2)` is `2`, since reductions are always the last dimension. This became cumbersome now that we have to support multiple reduction dimensions, so I refactored tiling into a dict like `{"x": 5, "r0_": 2, "r1_": 4}`. This required quite a few code changes, but I don't think it makes the underlying logic much more complex. This will also make it easier to eventually support simultaneous pointwise and reduction tiling, like `{"x": 5, "y": 5, "r0_": 2, "r1_": 4}`. (This is not supported today, but we might want to do it eventually.) The existing tiling algorithm generalized naturally to support reductions. For pointwise kernels, we tile the pointwise dimensions (`"x"`, `"y"`) as is. For reduction kernels, we never tile the `"x"` dimension, and only tile the reduction dimensions (`"r0_"`, `"r1_"`). Thus we only ever tile pointwise OR reduction dimensions, but not both. In principle it seems possible to support both, but it would likely require changes to the kernel fusion and autotuning logic. I thought it best to keep this PR as minimal as possible since it already touched a lot of different files. Unfortunately, these changes weren't enough to get block pointers in some seemingly simple test cases. In some tests for `argmax` and `var_mean`, we already collapse reduction dimensions into 1D and generate modular indexing expressions, prior to tiling. So it's not trivial to figure out how to expand the collapsed reduction dimension back to a shape that would simplify the indexing. To address these cases, this PR adds a new feature to the `config.prefer_nd_tiling` option, which analyzes reads and writes in the kernel, using the same mod-div pattern matching logic that generates block pointers later on. By matching this pattern, we can solve for the tiling splits which would simplify the indexing expression, and use then use that tiling to eliminate the modular indexing and emit a block pointer. This tiling mode is still off by default, but it's important for certain applications where we need to get as many block pointers as possible. # Test plan This touches pretty much anything that uses the Triton and Halide backends, so the existing CI provides good coverage. However, 2D reductions are gated behind a few feature flags like `config.prefer_nd_tiling` and `config.tile_reductions`, so this really only checks that the PR doesn't break 1D reductions. In addition to existing CI tests, this PR also adds some new tests that specifically stress 2D reductions: - `test_2d_reduction_odd_shapes`: test 2D reductions with a variety of ops and sizes. This covers the typical persistent and looped reductions. - `test_2d_reduce_no_x_dim`: test 2D reductions with no x dimension. - `test_2d_welford_reduction`: test 2D welford reductions with block pointers. - `test_welford_non_block_pointer`: test a 2D welford reduction when block pointer analysis fails. - `test_reduction_multiple_discontiguous_dims`: test reducing over more than one discontiguous dimension. We won't get a block pointer for this case, since that would require 3D tiling, but we're currently limited to 2D. - `test_2d_reduction_multi_kernel`: test multi kernel autotuning on a 2D softmax kernel. - `test_enable_tiled_reductions`: test that `config.triton.tile_reductions` enables/disables this feature. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137243 Approved by: https://github.com/jansel Co-authored-by: Yueming Hao <yhao@meta.com> Co-authored-by: Jason Ansel <jansel@meta.com>	2024-12-31 05:06:46 +00:00
Boyuan Feng	f3e5078c27	[Inductor] Relax size constraints for re-inplacing (#143884 ) Current reinplacing requires input buffer and output buffer has exactly the same storage size. However, matmul padding may increase the tensor size slightly for better performance, which prevents reinplacing. This PR changes the size constraints to be: - input and output buffer have exactly the same symbolic expression for storage size (i.e., sympy str). - it's statically known that 0.99 * input_size <= output_size <= input_size ### Apply on llm.c See the reuse of `buf1`. Before relaxing size requirements on re-inplacing: ([P1703512078](https://www.internalfb.com/phabricator/paste/view/P1703512078)) ![1](https://github.com/user-attachments/assets/1472f550-6eb8-4d5c-9965-49bbb20d81a9) After relaxing size requirements on re-inplacing: ([P1703513053](https://www.internalfb.com/phabricator/paste/view/P1703513053)) ![2](https://github.com/user-attachments/assets/416294dd-30eb-4e12-a36c-1aebf9af530b) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143884 Approved by: https://github.com/eellison	2024-12-31 03:52:47 +00:00
cyy	8df99b6a6e	Remove unneeded std::make_optional (#143575 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/143575 Approved by: https://github.com/Skylion007	2024-12-31 03:08:47 +00:00
Nikita Shulga	11bb94b7ea	[MPSInductor] Fix index generation for transpose (#143973 ) Alas, PythonPrinter would not work here, not would CppPrinter, so start building MetalPrinter. `pytest test/inductor/test_torchinductor.py -k _mps` score is 474 failed, 277 passed, 32 skipped Before this change: `pytest test/inductor/test_torchinductor.py -k _mps` reported 506 failed, 245 passed, 32 skipped Pull Request resolved: https://github.com/pytorch/pytorch/pull/143973 Approved by: https://github.com/jansel ghstack dependencies: #143948, #143949	2024-12-31 02:04:50 +00:00
Kai Londenberg	cb24013b5b	Fix assertion failure in pytorch profiler (#143940 ) Summary: Attempt to fix the following exception which occurred when profiling a Pytorch model ( Meta-internal LLM ) that also involved a ThreadPoolExecutor in the background: ``` Exception Found: !stack.empty() INTERNAL ASSERT FAILED at "fbcode/caffe2/torch/csrc/autograd/profiler_python.cpp":987, please report a bug to PyTorch. Python replay stack is empty. ``` The root cause of this issue seems to be that a thread call stack can be empty, which is asserted to not be empty. I fixed this with some minimal changes to profiler_python.cpp Approach: * Ensuring that the stack in question is not empty before trying to pop from it. Test Plan: * Tested manually on a reproducible scenario where the assertion failure was otherwise triggered ( repro too large to include here ). The assertion failure disappears. * CI Differential Revision: D67691558 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143940 Approved by: https://github.com/Skylion007, https://github.com/sraikund16	2024-12-31 01:43:04 +00:00
cyy	af629a8146	Enable readability-redundant-declaration (#143982 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/143982 Approved by: https://github.com/Skylion007	2024-12-31 00:20:10 +00:00
xinan.lin	934eaa503f	[Inductor XPU] Support max-autotune on XPU and reuse the corresponding Inductor UT. (#143266 ) This PR aims to add the functionality support of max-autotune for XPU. The current triton templates and configurations are not well optimized for XPU, so the performance is not ready yet. Also the `mm_plus_mm` template have accuracy issues in some cases. We will address these issues in the next PRs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143266 Approved by: https://github.com/EikanWang, https://github.com/jansel	2024-12-30 23:51:17 +00:00
Andrew Gu	d9a6ffb63c	[FSDP] Add workaround to fix `buffer_dtype` without root parameters (#143989 ) Fixes https://github.com/pytorch/pytorch/issues/143900 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143989 Approved by: https://github.com/H-Huang	2024-12-30 23:42:24 +00:00
Jason Ansel	2da7fb5320	[inductor] Make generated kernels deterministic (#143951 ) `"compile_id"` had slipped into our generated Triton code (in the metadata), which will defeat caching because the same kernels generated in a different order would not cache hit with eachother. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143951 Approved by: https://github.com/oulgen	2024-12-30 23:35:11 +00:00
Benjamin Glass	d88a8c41d5	Fix flaky "Upload test stats" job (#143991 ) Test stat uploading was intermittently failing due to certain XML strings being opportunistically converted to numbers, when string output was expected. This PR makes the conversion behavior optional, which should fix the stat uploads. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143991 Approved by: https://github.com/clee2000, https://github.com/huydhn	2024-12-30 21:40:01 +00:00
Benjamin Glass	d260bc4476	cpp_wrapper: minimize pybind11 dependency (#143772 ) Only include the parts of `pybind11` that handle GIL management within `cpp_wrapper`. This dramatically improves compilation times by reducing the number of headers we compile. Improvements on my local system are on the order of 2x. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143772 Approved by: https://github.com/Skylion007	2024-12-30 20:41:02 +00:00
Aaron Gokaslan	baee623691	[BE][Ez]: Update fmtlib submodule to 1.11.1 (#143937 ) * Exactly the same as previous fmtlib except it fixes an edgecase that could affect ABI compatibility between fmtlib versions. * Seems safe to update Pull Request resolved: https://github.com/pytorch/pytorch/pull/143937 Approved by: https://github.com/albanD	2024-12-30 19:46:27 +00:00
Wouter Devriendt	9d026000de	change import relative paths due to internal build failures (#143968 ) Internal builds failing due to #143355, changing imports to be relative, similar to other imports Pull Request resolved: https://github.com/pytorch/pytorch/pull/143968 Approved by: https://github.com/albanD	2024-12-30 17:19:49 +00:00
Nikita Shulga	c27c788e35	[MPS] Fix `torch.add(x,y, alpha=2)` crash (#143949 ) TODO: as followup PR replace this weird logic with shaders Fixes https://github.com/pytorch/pytorch/issues/143932 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143949 Approved by: https://github.com/Skylion007 ghstack dependencies: #143948	2024-12-30 17:16:29 +00:00
Nikita Shulga	beb6c2dea5	[MPS] Fix crash when mm is invoked with mixed dtypes (#143948 ) Simply by copy-n-pasting check from `a7915c56f6/aten/src/ATen/native/cuda/Blas.cpp (L254-L257)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/143948 Approved by: https://github.com/Skylion007	2024-12-30 17:13:34 +00:00
xinan.lin	7c1c0730be	[Inductor UT] Generalize newly introduced device-bias hard code in (#143975 ) test_pattern_matcher.py Fix #143974 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143975 Approved by: https://github.com/malfet	2024-12-30 16:47:19 +00:00
cyy	dca443835e	Enable more readability-redundant checks (#143963 ) They are helpful to simplifying code. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143963 Approved by: https://github.com/albanD	2024-12-30 14:49:33 +00:00
chuanqiw	438698b20b	[CD] Remove redundant triton dependency for xpu wheels (#143839 ) Due to XPU CD wheels enabled pypi dependencies by https://github.com/pytorch/pytorch/pull/141135, so the PYTORCH_EXTRA_INSTALL_REQUIREMENTS has value for XPU CD wheel build. Works for https://github.com/pytorch/pytorch/issues/139722 and https://github.com/pytorch/pytorch/issues/114850 Fixes #143838 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143839 Approved by: https://github.com/huydhn	2024-12-30 13:39:06 +00:00
PyTorch UpdateBot	2fa09853cb	Update slow tests (#143745 ) This PR is auto-generated weekly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/weekly.yml). Update the list of slow tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143745 Approved by: https://github.com/pytorchbot	2024-12-30 11:51:49 +00:00
Yutao Xu	2ed4d65af0	Update torch-xpu-ops commit pin (#143853 ) Update the torch-xpu-ops commit to [214f33](`214f33b9d9`), includes: - Fix building issue for transformer related operators - Improve XPU operator coverage Pull Request resolved: https://github.com/pytorch/pytorch/pull/143853 Approved by: https://github.com/EikanWang	2024-12-30 02:38:16 +00:00
PyTorch MergeBot	1b0d19a2cb	Revert "[inductor] Make generated kernels deterministic (#143951 )" This reverts commit 79b354ee37b7d8a06a48ca8cc4e19a3fd006b433. Reverted https://github.com/pytorch/pytorch/pull/143951 on behalf of https://github.com/wdvr due to failing tests on trunk ([comment](https://github.com/pytorch/pytorch/pull/143951#issuecomment-2564952267))	2024-12-30 02:06:38 +00:00
Henry Hu	cf89127137	[Torch.package] Add support for UntypedStorage tensors (#143930 ) Summary: fp8 uses untyped storage. Add support for torch.package by using the same logic as in serialization.py Differential Revision: D67684033 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143930 Approved by: https://github.com/henrylhtsang	2024-12-30 02:03:52 +00:00
emmettbicker	92d8965082	Adding support for differentiable lr, weight_decay, and betas in Adam/AdamW (#143726 ) Third PR in a series of PRs to broaden differentiable optimizer support w/ @janeyx99 (sorry for pinging over the holidays! I just wanted to put this one out but I am definitely not asking for review or anything like that rn) This is also going to probably be my last PR before the holidays! Note: This is a branch of #143710 -- I've never worked on a branch of a branch before so I wasn't sure about the protocol so I thought I'd just made the PR and wait until that one gets merged. This is adding support for differentiable lr, weight_decay, and betas to Adam and AdamW (but after refactoring AdamW into an Adam subclass, it's really just changing code in torch/optim/adam.py) I had one main thing I was wondering about, which is that adam already has a differentiable flag built in, so I have code like this ```py if differentiable and isinstance(beta2, Tensor): if beta2.requires_grad: exp_avg_sq.mul_(beta2).addcmul_(grad, grad.conj().mul(1 - beta2)) else: exp_avg_sq.mul_(beta2).addcmul_(grad, grad.conj(), value=1 - beta2) else: exp_avg_sq.mul_(beta2).addcmul_(grad, grad.conj(), value=1 - beta2) ``` That I could definitely simplify to just ```py if differentiable and isinstance(beta2, Tensor): exp_avg_sq.mul_(beta2).addcmul_(grad, grad.conj().mul(1 - beta2)) else: exp_avg_sq.mul_(beta2).addcmul_(grad, grad.conj(), value=1 - beta2) ``` It would definitely be a little slower in the case that it's differentiable but doesn't need a grad for beta2, but the code would also be a lot more clear and I'm debating speed vs future code usability. Also the line in the above example: ```py exp_avg_sq.mul_(beta2).addcmul_(grad, grad.conj().mul(1 - beta2)) ``` was concerning to me because it is considerably more expensive than `value=1 - beta2`, but I couldn't think of a better way to do it. Further work on #141832 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143726 Approved by: https://github.com/janeyx99	2024-12-30 01:11:57 +00:00
Kasperi Apell	a7915c56f6	Propagate callable parameter types using ParamSpec (#142306 ) (#143797 ) The codebase has a few locations where callable parameter type information is lost when the unpackings args and *kwargs are typed as Any. Refactor these instances to retain type information using typing_extensions.ParamSpec. Also, in these functions, enforce return type with TypeVar. Addresses #142306 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143797 Approved by: https://github.com/Skylion007 Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com> Co-authored-by: Xuehai Pan <XuehaiPan@outlook.com>	2024-12-29 23:03:14 +00:00
Jason Ansel	79b354ee37	[inductor] Make generated kernels deterministic (#143951 ) `"compile_id"` had slipped into our generated Triton code (in the metadata), which will defeat caching because the same kernels generated in a different order would not cache hit with eachother. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143951 Approved by: https://github.com/oulgen	2024-12-29 19:53:33 +00:00
Xuehai Pan	b6bdb67f82	[BE][Easy] use `pathlib.Path` instead of `dirname` / `".."` / `pardir` (#129374 ) Changes by apply order: 1. Replace all `".."` and `os.pardir` usage with `os.path.dirname(...)`. 2. Replace nested `os.path.dirname(os.path.dirname(...))` call with `str(Path(...).parent.parent)`. 3. Reorder `.absolute()` ~/ `.resolve()`~ and `.parent`: always resolve the path first. `.parent{...}.absolute()` -> `.absolute().parent{...}` 4. Replace chained `.parent x N` with `.parents[${N - 1}]`: the code is easier to read (see 5.) `.parent.parent.parent.parent` -> `.parents[3]` 5. ~Replace `.parents[${N - 1}]` with `.parents[${N} - 1]`: the code is easier to read and does not introduce any runtime overhead.~ ~`.parents[3]` -> `.parents[4 - 1]`~ 6. ~Replace `.parents[2 - 1]` with `.parent.parent`: because the code is shorter and easier to read.~ Pull Request resolved: https://github.com/pytorch/pytorch/pull/129374 Approved by: https://github.com/justinchuby, https://github.com/malfet	2024-12-29 17:23:13 +00:00
bobrenjc93	7101b8ca35	remove allow-untyped-defs from onnx/_internal/_lazy_import.py (#143943 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143943 Approved by: https://github.com/justinchuby	2024-12-29 10:29:43 +00:00
bobrenjc93	cf0b72c4ab	remove allow-untyped-defs from _inductor/compile_worker/watchdog.py (#143941 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143941 Approved by: https://github.com/Skylion007	2024-12-29 01:05:09 +00:00
bobrenjc93	3ba6fcd3ff	remove allow-untyped-defs from torch/_size_docs.py (#143942 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143942 Approved by: https://github.com/Skylion007	2024-12-29 01:00:46 +00:00
Yanan Cao (PyTorch)	85f348578b	[Codemod][AddExplicitStrictExportArg] caffe2/test/inductor (#143929 ) Differential Revision: D67682313 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143929 Approved by: https://github.com/hl475	2024-12-28 23:39:21 +00:00
bobrenjc93	e1abbe155e	remove allow-untyped-defs from ao/nn/qat/dynamic/modules/linear.py (#143919 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143919 Approved by: https://github.com/Skylion007, https://github.com/malfet	2024-12-28 20:50:48 +00:00
Nikita Shulga	3054aae493	[MPS] Fix fmin/fmax for scalar argument (#143934 ) CPU scalar promotion to GPU is allowed for CUDA and shoudl be allowed for MPS as well (at the very least it should not crash) Fixes https://github.com/pytorch/pytorch/issues/143933 https://github.com/pytorch/pytorch/issues/142203 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143934 Approved by: https://github.com/Skylion007	2024-12-28 17:07:19 +00:00
PyTorch MergeBot	45a709d9ec	Revert "Add torch._scaled_mm for CPU (#139975 )" This reverts commit cbc4cf3043a7316c1f6e86b1e22d96042a59489c. Reverted https://github.com/pytorch/pytorch/pull/139975 on behalf of https://github.com/malfet due to It broke the same test, but on ROCM this time, though it was classified as flaky for some reason, see `d8c3900d80/1` ([comment](https://github.com/pytorch/pytorch/pull/139975#issuecomment-2564378146))	2024-12-28 16:49:38 +00:00
PyTorch MergeBot	8cccc46e33	Revert "Add AOT inductor support for _scaled_mm for CPU (#141961 )" This reverts commit 3fabd10c40c632104e420ae8e3721f33176e8640. Reverted https://github.com/pytorch/pytorch/pull/141961 on behalf of https://github.com/malfet due to It broke the same test, but on ROCM this time, though it was classified as flaky for some reason, see `d8c3900d80/1` ([comment](https://github.com/pytorch/pytorch/pull/139975#issuecomment-2564378146))	2024-12-28 16:49:38 +00:00
Nikita Shulga	d8c3900d80	[Inductor] Implement primitive Metal compiler (#143893 ) Still work in progress, only works for element wise operations. Current implementation could be used to turn something like ```python def f(x): return x[:,::2].sin() + x[:, 1::2].cos() ``` into the following shader ```python # Topologically Sorted Source Nodes: [sin, cos, add], Original ATen: [aten.sin, aten.cos, aten.add] # Source node to ATen node mapping: # add => add # cos => cos # sin => sin # Graph fragment: # %sin : [num_users=1] = call_function[target=torch.ops.aten.sin.default](args = (%slice_2,), kwargs = {}) # %cos : [num_users=1] = call_function[target=torch.ops.aten.cos.default](args = (%slice_4,), kwargs = {}) # %add : [num_users=1] = call_function[target=torch.ops.aten.add.Tensor](args = (%sin, %cos), kwargs = {}) mps_lib = torch.mps._compile_shader(""" kernel void kernel_0( device float* out_ptr0, constant float* in_ptr0, uint xindex [[thread_position_in_grid]] ) { int x0 = xindex; auto tmp0 = in_ptr0[2x0]; auto tmp1 = metal::precise::sin(tmp0); auto tmp2 = in_ptr0[2x0 + 1]; auto tmp3 = metal::precise::cos(tmp2); auto tmp4 = tmp1 + tmp3; out_ptr0[x0] = static_cast<float>(tmp4); } """) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/143893 Approved by: https://github.com/jansel ghstack dependencies: #143891, #143892	2024-12-28 06:58:32 +00:00
leslie-fang-intel	74028cfd0c	[Inductor][CPP] Fix Data Type issue of frexp (#143746 ) Summary Fix issue: https://github.com/pytorch/pytorch/issues/143729. `frexp` has 1 input but 2 output tensor with different data type, current `deduce_dtype_for_cpp_cse_variable` can't deduce the data type for each output correctly due to missing of output index. In this PR, we set the data type of cse var in the codegen of `frexp` and avoid it being overridden in the following flow. Test Plan ``` python -u -m pytest -s -v test/inductor/test_cpu_repro.py -k test_frexp ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/143746 Approved by: https://github.com/jgong5	2024-12-28 06:00:13 +00:00
Animesh Jain	01980cac38	[dynamo] Make ConstDictKeySource a subclass of ChainedSource (#143924 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143924 Approved by: https://github.com/jansel	2024-12-28 05:59:45 +00:00
Jiang, Yanbing	3fabd10c40	Add AOT inductor support for _scaled_mm for CPU (#141961 ) This PR is to add AOT inductor support for _scaled_mm for CPU. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141961 Approved by: https://github.com/malfet ghstack dependencies: #139975	2024-12-28 05:57:35 +00:00
Jiang, Yanbing	cbc4cf3043	Add torch._scaled_mm for CPU (#139975 ) This PR is to add `torch._scaled_mm` for CPU backend. `_scaled_mm_out_cpu` and `_scaled_mm_cpu` are new added and included in `torch._scaled_mm` CPU dispatch. We also add `_scaled_mm_out_cpu_emulated` as a fallback function if the current platform cannot run FP8 matmul using oneDNN. And this PR also updates the various UTs related to FP8 to support CPU tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139975 Approved by: https://github.com/mingfeima, https://github.com/jgong5, https://github.com/malfet	2024-12-28 05:49:06 +00:00
eellison	d3e9133ab2	Fix separate in process bisector cache, cleanup on exit (#143661 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143661 Approved by: https://github.com/ezyang ghstack dependencies: #143657	2024-12-28 03:20:37 +00:00
Eddie Yan	1e246ef05b	[CUDA][CUDA graphs][RNG] Skip replay prologue if `wholegraph_increment` is 0 (#143777 ) #143572 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143777 Approved by: https://github.com/ngimel, https://github.com/eellison	2024-12-28 02:31:26 +00:00
Nikita Shulga	4a7cf0dbff	[Inductor] Add MPS device op overrides (#143892 ) Mostly dummy interface as MPS backend currently limited to a single device Pull Request resolved: https://github.com/pytorch/pytorch/pull/143892 Approved by: https://github.com/jansel ghstack dependencies: #143891	2024-12-28 02:11:45 +00:00
Jerry Zhang	ad78edee8e	Add support for list, tuple and dict in numeric debugger (#143882 ) Summary: Previously numeric debugger only supports torch.Tensor, this PR adds support for list, tuple and dict as well Test Plan: python test/test_quantization.py -k test_extract_results_from_loggers_list_output Reviewers: Subscribers: Tasks: Tags: Differential Revision: [D67660049](https://our.internmc.facebook.com/intern/diff/D67660049) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143882 Approved by: https://github.com/dulinriley	2024-12-28 02:10:31 +00:00
Animesh Jain	c3c27aef34	[dynamo] Remove HFPretrained config hack (#143698 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143698 Approved by: https://github.com/williamwen42, https://github.com/jansel ghstack dependencies: #143888	2024-12-28 02:03:13 +00:00
eellison	7c343a9d68	Fix emulate low precision bool inp (#143657 ) Fix for https://github.com/pytorch/pytorch/issues/143502 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143657 Approved by: https://github.com/BoyuanFeng	2024-12-28 01:51:28 +00:00
bobrenjc93	88ccf2fa5e	remove allow-untyped-defs from distributed/elastic/multiprocessing/subprocess_handler/handlers.py (#143917 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143917 Approved by: https://github.com/Skylion007	2024-12-28 00:13:05 +00:00
Colin Peppler	e3fefdfbf0	[CUTLASS] fix addmm (#143537 ) We would get a CUDA IMA before because we pass Bias in for X. So, we need to re-order the inputs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143537 Approved by: https://github.com/chenyang78 ghstack dependencies: #143528	2024-12-27 23:47:55 +00:00
Colin Peppler	b54620f40f	[CUTLASS] fix bugs: extra data_ptr() call, wrong size symbol name, bias symbol not added (#143528 ) A few small things in this PR: - fixed a bug where `workspace.data_ptr().data_ptr()` showed up - for SM80 CUTLASS kernels, the symbol size for W.size(1) was never created - for addmm kernels, the ldc bias symbol never showed up Pull Request resolved: https://github.com/pytorch/pytorch/pull/143528 Approved by: https://github.com/henrylhtsang	2024-12-27 23:38:18 +00:00
bobrenjc93	c17d767686	remove allow-untyped-defs from _inductor/codegen/rocm/rocm_template_buffer.py (#143870 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143870 Approved by: https://github.com/aorenste, https://github.com/Skylion007	2024-12-27 23:28:51 +00:00
bobrenjc93	63d6e1f743	remove allow-untyped-defs from _inductor/codegen/aoti_hipify_utils.py (#143916 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143916 Approved by: https://github.com/Skylion007	2024-12-27 23:25:37 +00:00
Dmitry Nikolaev	928e01545c	restore 'unused' variable to fix test_cuda_device_memory_allocated (#143885 ) This PR fix `test_cuda_multigpu.py::TestCudaMultiGPU::test_cuda_device_memory_allocated` by restoring a deleted 'unused' variable from commit `d8c8ba2440` Pull Request resolved: https://github.com/pytorch/pytorch/pull/143885 Approved by: https://github.com/Skylion007 Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>	2024-12-27 23:18:13 +00:00
Emmett Bicker	0de661dc27	Add support for differentiable weight decay (#143679 ) (Actual) second PR in a larger project to broaden support for differentiable optimizers with @janeyx99! In this PR, I did a lot of pattern matching from the previous PR to add support for differentiable weight_decay. And also added a single new line on line 359 (previously line 352) to make the code from the last PR a little easier to read Continuation of progress on #141832 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143679 Approved by: https://github.com/janeyx99 Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com>	2024-12-27 23:14:43 +00:00
bobrenjc93	c0c7f881da	remove allow-untyped-defs from distributed/pipelining/_unflatten.py (#143915 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143915 Approved by: https://github.com/aorenste, https://github.com/Skylion007, https://github.com/malfet	2024-12-27 22:21:28 +00:00
bobrenjc93	af823bd526	remove allow-untyped-defs from utils/tensorboard/_convert_np.py (#143918 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143918 Approved by: https://github.com/Skylion007	2024-12-27 22:19:33 +00:00
Nikita Shulga	fe398de769	[EZ] Update sympy to 1.13.3 (#143908 ) And remove python>=3.9 check as it currently covers all supported python versions Fixes https://github.com/pytorch/pytorch/issues/143907 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143908 Approved by: https://github.com/Skylion007, https://github.com/huydhn	2024-12-27 21:32:55 +00:00
PyTorch MergeBot	b5042cfa58	Revert "remove allow-untyped-defs from torch/ao/__init__.py (#143604 )" This reverts commit 1598d458797e69376a9a148bd37fb6e8167d22e3. Reverted https://github.com/pytorch/pytorch/pull/143604 on behalf of https://github.com/wdvr due to failing typing checks in torchao ([comment](https://github.com/pytorch/pytorch/pull/143604#issuecomment-2564043233))	2024-12-27 21:30:02 +00:00
Nikita Shulga	7a13bfa1ad	[EZ] Update jinja2 to 3.1.5 (#143923 ) To make Dependabot happy about https://cwe.mitre.org/data/definitions/150.html Pull Request resolved: https://github.com/pytorch/pytorch/pull/143923 Approved by: https://github.com/Skylion007	2024-12-27 21:10:21 +00:00
Joel Schlosser	228b228449	Fix batch-specific attention mod for NJT + Flex (#143866 ) Fixes #143788 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143866 Approved by: https://github.com/Skylion007, https://github.com/cpuhrsch	2024-12-27 20:51:41 +00:00
Nikita Shulga	1e65dec2b9	[Dynamo] Add MPSDevice interface (#143891 ) That simply checks if device is available and whether or not it supports bf16 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143891 Approved by: https://github.com/jansel	2024-12-27 20:31:44 +00:00
Xuehai Pan	d2f769476f	[Easy] add quotes to shell activation commands (#143902 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143902 Approved by: https://github.com/Skylion007, https://github.com/malfet	2024-12-27 19:17:46 +00:00
Animesh Jain	a87cd5283b	[dynamo] Trace through overridden __getattribute__ method (#143888 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143888 Approved by: https://github.com/jansel	2024-12-27 18:10:00 +00:00
bobrenjc93	fda9048ca8	remove allow-untyped-defs from distributed/elastic/multiprocessing/errors/handlers.py (#143869 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143869 Approved by: https://github.com/Skylion007	2024-12-27 15:49:19 +00:00
YangQun1	a20765a9c1	subgraph rewriter supports matched pattern with no users (#143842 ) Fixes #143841 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143842 Approved by: https://github.com/yushangdi	2024-12-27 12:45:39 +00:00
eellison	9e8d84f863	Fix duplicate pattern error (#139321 ) vllm had an error when we were incorrectly stating two patterns are duplicates. See, comment inline: For a particular generated pattern repr, store all the equivalent graphs that used to generate them. Because we ignore certain patterns in searching, but not in matching, use the graph to distinguish if two equivalent searches are actually different. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139321 Approved by: https://github.com/shunting314	2024-12-27 11:10:46 +00:00
PyTorch MergeBot	3571476739	Revert "fix randint distribution for large max (#143787 )" This reverts commit 8059d56ec364feb554f3fb90012a0fc2d2104e7f. Reverted https://github.com/pytorch/pytorch/pull/143787 on behalf of https://github.com/wdvr due to failing internal tests, to be fixed first ([comment](https://github.com/pytorch/pytorch/pull/143787#issuecomment-2563493323))	2024-12-27 09:16:36 +00:00
PyTorch MergeBot	f6801ba4b3	Revert "Use random64 in Fischer-Yates algorithm for large N (#143682 )" This reverts commit 7013be0094e8d3ded2ba2f948082f98d63e622bb. Reverted https://github.com/pytorch/pytorch/pull/143682 on behalf of https://github.com/wdvr due to failing Meta internal tests that need to be updated ([comment](https://github.com/pytorch/pytorch/pull/143682#issuecomment-2563487675))	2024-12-27 09:09:33 +00:00
Yanan Cao (PyTorch)	ba5cacbc17	[Codemod][AddExplicitStrictExportArg] caffe2/test (#143688 ) Reviewed By: avikchaudhuri Differential Revision: D67530154 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143688 Approved by: https://github.com/tugsbayasgalan	2024-12-27 07:58:44 +00:00
Animesh Jain	969415885d	[inductor][invoke_subgraph] Support None/int as input/output of invoke_subgraph (#139373 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139373 Approved by: https://github.com/eellison	2024-12-27 06:46:09 +00:00
cyy	379bbef23c	Enable more C++ warnings (#143355 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/143355 Approved by: https://github.com/albanD	2024-12-27 05:46:57 +00:00
PyTorch MergeBot	fca457b5db	Revert "Add torch._scaled_mm for CPU (#139975 )" This reverts commit 3f80632c802f1d9fafd0c303d45ba2376b9c1e11. Reverted https://github.com/pytorch/pytorch/pull/139975 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing some tests in trunk ([comment](https://github.com/pytorch/pytorch/pull/139975#issuecomment-2563331259))	2024-12-27 05:25:17 +00:00
Animesh Jain	0f474a960b	[dynamo] Remove dead code after introducing UserDefinedDictVariable (#143699 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143699 Approved by: https://github.com/williamwen42, https://github.com/yanboliang, https://github.com/jansel ghstack dependencies: #143722	2024-12-27 04:51:35 +00:00
Animesh Jain	e296bab614	[dynamo] Remove DICT_SUBCLASS_GUARD_MANAGER and use dict.keys (#143722 ) In hinsight, we never needed a DICT_SUBCLASS_GUARD_MANAGER, because Dynamo would inline through the overridden keys method. In this PR, we ensure that while creating guards and constructing variable trackers, we get the `d.keys()` value by using `dict.keys(d)`. This ensures that we do not call overridden keys method. Therefore, the C++ guard can use `PyDict_Next` directly to check the guards. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143722 Approved by: https://github.com/jansel	2024-12-27 04:51:35 +00:00
bobrenjc93	d60282c177	remove allow-untyped-defs from _inductor/codegen/cpu_device_op_overrides.py (#143881 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143881 Approved by: https://github.com/aorenste	2024-12-27 04:10:47 +00:00
Huamin Li	43853691bc	[Quantization] add an option keep_original_weights in _lower_to_native_backend (#141049 ) Differential Revision: D66153809 This diff adds an option to keep_original_weights so we can track back the original weight and bias after performing prepare_fx and convert_fx Pull Request resolved: https://github.com/pytorch/pytorch/pull/141049 Approved by: https://github.com/jerryzh168	2024-12-27 04:02:07 +00:00
Chirag Pandya	809106a93f	[fr][c10d] fix flaky test (#143878 ) Summary: Test erroneously assumed that input/output sizes are same and that all states are matchable. Fixes issue #143798 Test Plan: Test passes Reviewers Test passes Pull Request resolved: https://github.com/pytorch/pytorch/pull/143878 Approved by: https://github.com/fduwjj ghstack dependencies: #143865	2024-12-27 03:13:15 +00:00
Chirag Pandya	1cd70e7e23	[fr][c10d] log trace capture enabled or not in flight recorder (#143865 ) Summary: Refactor logging for flight recorder so we can log if the capture was with or without stack trace capture enabled. We introduce a new column ('trace_enabled') in the logger. Test Plan: Tested on local job and noted that correct output was produced. Internal link: https://fburl.com/scuba/c10d_flight_recorder/ulhqnmhg Pull Request resolved: https://github.com/pytorch/pytorch/pull/143865 Approved by: https://github.com/fduwjj	2024-12-27 03:07:55 +00:00
Jason Ansel	6bdf2addc5	[inductor] Simplify get_launch_args_* handling (#143835 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143835 Approved by: https://github.com/eellison, https://github.com/shunting314 ghstack dependencies: #143813, #143814, #143815, #143817, #143818	2024-12-27 02:02:11 +00:00
Jason Ansel	138efb3002	[inductor] Move GPUTarget backwards compat to triton_compat.py (#143818 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143818 Approved by: https://github.com/eellison ghstack dependencies: #143813, #143814, #143815, #143817	2024-12-27 02:02:11 +00:00
Jason Ansel	be1936804b	[inductor] Drop support for pre-ASTSource Triton (#143817 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143817 Approved by: https://github.com/eellison ghstack dependencies: #143813, #143814, #143815	2024-12-27 02:02:11 +00:00
Jason Ansel	f3d0f67039	[inductor] Minor refactor of hip compile_meta (#143815 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143815 Approved by: https://github.com/eellison ghstack dependencies: #143813, #143814	2024-12-27 02:02:11 +00:00
bobrenjc93	29841b9414	remove allow-untyped-defs from torch/distributed/pipelining/_debug.py (#143871 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143871 Approved by: https://github.com/Skylion007	2024-12-27 01:20:26 +00:00
bobrenjc93	373dba35f9	remove allow-untyped-defs from fx/experimental/refinement_types.py (#143868 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143868 Approved by: https://github.com/Skylion007	2024-12-27 01:00:45 +00:00
Xuehai Pan	c4bff71854	[Easy] Add ROCm support to nightly pull tool (#141282 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/141282 Approved by: https://github.com/malfet ghstack dependencies: #143263	2024-12-27 00:07:38 +00:00
Natalia Gimelshein	8059d56ec3	fix randint distribution for large max (#143787 ) Fixes #ISSUE_NUMBER Similar to #143682, for large maximum values we were sampling integers via % and it doesn't provide uniform distribution. Here we limit the max skew to approx 1% (random32 is used for max values `<= 2**32 / 128`) This comes with significant perf penalty, especially for cuda, but it's a pretty bad bug, so we'll have to figure out what can be done to improve it. `torch.compile` has always been producing correct results for this, and it's performance is also significantly better than current eager (eager is ~660 GB/s on H100, torch.compile 1200 GB/s), so we have to figure out why torch.compile is better. `__launch_bounds__` slightly regress perf, so perhaps we can figure out how to specify them better, but it's only 20-30 GB/s, so the big difference is still unexplained. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143787 Approved by: https://github.com/eqy	2024-12-26 23:54:03 +00:00
bobrenjc93	1598d45879	remove allow-untyped-defs from torch/ao/__init__.py (#143604 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143604 Approved by: https://github.com/aorenste	2024-12-26 23:27:16 +00:00
Jiang, Yanbing	3f80632c80	Add torch._scaled_mm for CPU (#139975 ) This PR is to add `torch._scaled_mm` for CPU backend. `_scaled_mm_out_cpu` and `_scaled_mm_cpu` are new added and included in `torch._scaled_mm` CPU dispatch. We also add `_scaled_mm_out_cpu_emulated` as a fallback function if the current platform cannot run FP8 matmul using oneDNN. And this PR also updates the various UTs related to FP8 to support CPU tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139975 Approved by: https://github.com/mingfeima, https://github.com/jgong5, https://github.com/malfet ghstack dependencies: #139974	2024-12-26 22:22:42 +00:00
PyTorch MergeBot	26364428f5	Revert "[dynamo] Remove DICT_SUBCLASS_GUARD_MANAGER and use dict.keys (#143722 )" This reverts commit fe95cbe018218d159ba0a0269045b31ff72f1a20. Reverted https://github.com/pytorch/pytorch/pull/143722 on behalf of https://github.com/wdvr due to failing internal tests ([comment](https://github.com/pytorch/pytorch/pull/143722#issuecomment-2563127017))	2024-12-26 22:04:36 +00:00
PyTorch MergeBot	ee25daef5a	Revert "[dynamo] Remove dead code after introducing UserDefinedDictVariable (#143699 )" This reverts commit 7d1c6661397f9bff93c1ea389506c8a163d7a2ab. Reverted https://github.com/pytorch/pytorch/pull/143699 on behalf of https://github.com/wdvr due to failing internal tests ([comment](https://github.com/pytorch/pytorch/pull/143722#issuecomment-2563127017))	2024-12-26 22:04:35 +00:00
Darshan Sanghani	2966fb3708	[pytorch/et] Allow ET to save additional resources for completing a trace like generated kernels and index tensor data (#143775 ) The resources directory lets ET observer dump any additional data like Triton kernels while capturing the ET. This allows us to use the ET trace to replay PT2 workloads and get visibility into data like generated kernels and their usage in a model, index tensor data etc. We also added a few ways to enable ET and ET Resources through the OS environment variables. Setting `ENABLE_PYTORCH_EXECUTION_TRACE` will enable default Execution Tracing in Pytorch. Additionally setting `ENABLE_PYTORCH_EXECUTION_TRACE_EXTRAS` will enable ET to collect extra resources from the ET run like Triton Kernels. Differential Revision: [D67610542](https://our.internmc.facebook.com/intern/diff/D67610542/) NOTE FOR REVIEWERS: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D67610542/)! Pull Request resolved: https://github.com/pytorch/pytorch/pull/143775 Approved by: https://github.com/shengfukevin, https://github.com/wdvr	2024-12-26 21:15:39 +00:00
chuanqiw	96e9a5aeec	[CI] Disable sccache for xpu test (#143851 ) WA for https://github.com/pytorch/pytorch/issues/143585 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143851 Approved by: https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2024-12-26 19:45:04 +00:00
Aaron Orenstein	3df12d38cf	dynamo tracing perf: cache cleaned_instructions: 33.7 -> 30.0 (#143070 ) See #143056 for overall docs. This PR: Cache the interesting/expensive bits of `cleaned_instructions()` Pull Request resolved: https://github.com/pytorch/pytorch/pull/143070 Approved by: https://github.com/jansel	2024-12-26 19:02:08 +00:00
Xuehai Pan	51a7ecde80	[Easy] Bump CUDA nightly version to 11.8 / 12.4 / 12.6 in nightly pull tool (#143263 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143263 Approved by: https://github.com/malfet	2024-12-26 19:01:38 +00:00
lzhang2	78502a58ba	Enable FSDP2 on XPU device (#143737 ) Motivation: Enabling FSDP2 on XPU device. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143737 Approved by: https://github.com/awgu	2024-12-26 18:34:11 +00:00
PyTorch MergeBot	475656fd9c	Revert "[BE][Easy] use `pathlib.Path` instead of `dirname` / `".."` / `pardir` (#129374 )" This reverts commit 2293fe1024812d6349f6e2b3b7de82c6b73f11e4. Reverted https://github.com/pytorch/pytorch/pull/129374 on behalf of https://github.com/malfet due to failing internal ROCM builds with error: ModuleNotFoundError: No module named hipify ([comment](https://github.com/pytorch/pytorch/pull/129374#issuecomment-2562973920))	2024-12-26 17:32:23 +00:00
PyTorch MergeBot	cc4e70b7c3	Revert "Use absolute path `path.resolve()` -> `path.absolute()` (#129409 )" This reverts commit 135c7db99d646b8bd9603bf969d47d3dec5987b1. Reverted https://github.com/pytorch/pytorch/pull/129409 on behalf of https://github.com/malfet due to need to revert to as dependency of https://github.com/pytorch/pytorch/pull/129374 ([comment](https://github.com/pytorch/pytorch/pull/129409#issuecomment-2562969825))	2024-12-26 17:26:06 +00:00
PyTorch MergeBot	9255ffc841	Revert "Enable more C++ warnings (#143355 )" This reverts commit daa3ffe0ebff58577b8db964447b6abc6de53a25. Reverted https://github.com/pytorch/pytorch/pull/143355 on behalf of https://github.com/malfet due to It fails internal build system as it kind of breaks separation between native and native/cpu ([comment](https://github.com/pytorch/pytorch/pull/143355#issuecomment-2562961546))	2024-12-26 17:13:10 +00:00
Jason Ansel	cf76c05b4d	[inductor] Refactor conditional triton imports into triton_compat.py (#143814 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143814 Approved by: https://github.com/Skylion007 ghstack dependencies: #143813	2024-12-26 09:14:06 +00:00
Jason Ansel	efac5ed81b	[inductor] Reorder imports in codecache.py (#143813 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143813 Approved by: https://github.com/Skylion007	2024-12-26 09:14:06 +00:00
dependabot[bot]	bf8da4c145	Bump jinja2 from 3.1.4 to 3.1.5 in /.ci/docker (#143844 ) Bumps [jinja2](https://github.com/pallets/jinja) from 3.1.4 to 3.1.5. <details> <summary>Release notes</summary> <p><em>Sourced from <a href="https://github.com/pallets/jinja/releases">jinja2's releases</a>.</em></p> <blockquote> <h2>3.1.5</h2> <p>This is the Jinja 3.1.5 security fix release, which fixes security issues and bugs but does not otherwise change behavior and should not result in breaking changes compared to the latest feature release.</p> <p>PyPI: <a href="https://pypi.org/project/Jinja2/3.1.5/">https://pypi.org/project/Jinja2/3.1.5/</a> Changes: <a href="https://jinja.palletsprojects.com/changes/#version-3-1-5">https://jinja.palletsprojects.com/changes/#version-3-1-5</a> Milestone: <a href="https://github.com/pallets/jinja/milestone/16?closed=1">https://github.com/pallets/jinja/milestone/16?closed=1</a></p> <ul> <li>The sandboxed environment handles indirect calls to <code>str.format</code>, such as by passing a stored reference to a filter that calls its argument. <a href="https://github.com/pallets/jinja/security/advisories/GHSA-q2x7-8rv6-6q7h">GHSA-q2x7-8rv6-6q7h</a></li> <li>Escape template name before formatting it into error messages, to avoid issues with names that contain f-string syntax. <a href="https://redirect.github.com/pallets/jinja/issues/1792">#1792</a>, <a href="https://github.com/pallets/jinja/security/advisories/GHSA-gmj6-6f8f-6699">GHSA-gmj6-6f8f-6699</a></li> <li>Sandbox does not allow <code>clear</code> and <code>pop</code> on known mutable sequence types. <a href="https://redirect.github.com/pallets/jinja/issues/2032">#2032</a></li> <li>Calling sync <code>render</code> for an async template uses <code>asyncio.run</code>. <a href="https://redirect.github.com/pallets/jinja/issues/1952">#1952</a></li> <li>Avoid unclosed <code>auto_aiter</code> warnings. <a href="https://redirect.github.com/pallets/jinja/issues/1960">#1960</a></li> <li>Return an <code>aclose</code>-able <code>AsyncGenerator</code> from <code>Template.generate_async</code>. <a href="https://redirect.github.com/pallets/jinja/issues/1960">#1960</a></li> <li>Avoid leaving <code>root_render_func()</code> unclosed in <code>Template.generate_async</code>. <a href="https://redirect.github.com/pallets/jinja/issues/1960">#1960</a></li> <li>Avoid leaving async generators unclosed in blocks, includes and extends. <a href="https://redirect.github.com/pallets/jinja/issues/1960">#1960</a></li> <li>The runtime uses the correct <code>concat</code> function for the current environment when calling block references. <a href="https://redirect.github.com/pallets/jinja/issues/1701">#1701</a></li> <li>Make <code>\|unique</code> async-aware, allowing it to be used after another async-aware filter. <a href="https://redirect.github.com/pallets/jinja/issues/1781">#1781</a></li> <li><code>\|int</code> filter handles <code>OverflowError</code> from scientific notation. <a href="https://redirect.github.com/pallets/jinja/issues/1921">#1921</a></li> <li>Make compiling deterministic for tuple unpacking in a <code>{% set ... %}</code> call. <a href="https://redirect.github.com/pallets/jinja/issues/2021">#2021</a></li> <li>Fix dunder protocol (<code>copy</code>/<code>pickle</code>/etc) interaction with <code>Undefined</code> objects. <a href="https://redirect.github.com/pallets/jinja/issues/2025">#2025</a></li> <li>Fix <code>copy</code>/<code>pickle</code> support for the internal <code>missing</code> object. <a href="https://redirect.github.com/pallets/jinja/issues/2027">#2027</a></li> <li><code>Environment.overlay(enable_async)</code> is applied correctly. <a href="https://redirect.github.com/pallets/jinja/issues/2061">#2061</a></li> <li>The error message from <code>FileSystemLoader</code> includes the paths that were searched. <a href="https://redirect.github.com/pallets/jinja/issues/1661">#1661</a></li> <li><code>PackageLoader</code> shows a clearer error message when the package does not contain the templates directory. <a href="https://redirect.github.com/pallets/jinja/issues/1705">#1705</a></li> <li>Improve annotations for methods returning copies. <a href="https://redirect.github.com/pallets/jinja/issues/1880">#1880</a></li> <li><code>urlize</code> does not add <code>mailto:</code> to values like <code>@a@b</code>. <a href="https://redirect.github.com/pallets/jinja/issues/1870">#1870</a></li> <li>Tests decorated with <code>@pass_context</code> can be used with the <code>\|select</code> filter. <a href="https://redirect.github.com/pallets/jinja/issues/1624">#1624</a></li> <li>Using <code>set</code> for multiple assignment (<code>a, b = 1, 2</code>) does not fail when the target is a namespace attribute. <a href="https://redirect.github.com/pallets/jinja/issues/1413">#1413</a></li> <li>Using <code>set</code> in all branches of <code>{% if %}{% elif %}{% else %}</code> blocks does not cause the variable to be considered initially undefined. <a href="https://redirect.github.com/pallets/jinja/issues/1253">#1253</a></li> </ul> </blockquote> </details> <details> <summary>Changelog</summary> <p><em>Sourced from <a href="https://github.com/pallets/jinja/blob/main/CHANGES.rst">jinja2's changelog</a>.</em></p> <blockquote> <h2>Version 3.1.5</h2> <p>Released 2024-12-21</p> <ul> <li>The sandboxed environment handles indirect calls to <code>str.format</code>, such as by passing a stored reference to a filter that calls its argument. :ghsa:<code>q2x7-8rv6-6q7h</code></li> <li>Escape template name before formatting it into error messages, to avoid issues with names that contain f-string syntax. :issue:<code>1792</code>, :ghsa:<code>gmj6-6f8f-6699</code></li> <li>Sandbox does not allow <code>clear</code> and <code>pop</code> on known mutable sequence types. :issue:<code>2032</code></li> <li>Calling sync <code>render</code> for an async template uses <code>asyncio.run</code>. :pr:<code>1952</code></li> <li>Avoid unclosed <code>auto_aiter</code> warnings. :pr:<code>1960</code></li> <li>Return an <code>aclose</code>-able <code>AsyncGenerator</code> from <code>Template.generate_async</code>. :pr:<code>1960</code></li> <li>Avoid leaving <code>root_render_func()</code> unclosed in <code>Template.generate_async</code>. :pr:<code>1960</code></li> <li>Avoid leaving async generators unclosed in blocks, includes and extends. :pr:<code>1960</code></li> <li>The runtime uses the correct <code>concat</code> function for the current environment when calling block references. :issue:<code>1701</code></li> <li>Make <code>\|unique</code> async-aware, allowing it to be used after another async-aware filter. :issue:<code>1781</code></li> <li><code>\|int</code> filter handles <code>OverflowError</code> from scientific notation. :issue:<code>1921</code></li> <li>Make compiling deterministic for tuple unpacking in a <code>{% set ... %}</code> call. :issue:<code>2021</code></li> <li>Fix dunder protocol (<code>copy</code>/<code>pickle</code>/etc) interaction with <code>Undefined</code> objects. :issue:<code>2025</code></li> <li>Fix <code>copy</code>/<code>pickle</code> support for the internal <code>missing</code> object. :issue:<code>2027</code></li> <li><code>Environment.overlay(enable_async)</code> is applied correctly. :pr:<code>2061</code></li> <li>The error message from <code>FileSystemLoader</code> includes the paths that were searched. :issue:<code>1661</code></li> <li><code>PackageLoader</code> shows a clearer error message when the package does not contain the templates directory. :issue:<code>1705</code></li> <li>Improve annotations for methods returning copies. :pr:<code>1880</code></li> <li><code>urlize</code> does not add <code>mailto:</code> to values like <code>@a@b</code>. :pr:<code>1870</code></li> <li>Tests decorated with <code>@pass_context`` can be used with the ``\|select`` filter. :issue:</code>1624`</li> <li>Using <code>set</code> for multiple assignment (<code>a, b = 1, 2</code>) does not fail when the target is a namespace attribute. :issue:<code>1413</code></li> <li>Using <code>set</code> in all branches of <code>{% if %}{% elif %}{% else %}</code> blocks does not cause the variable to be considered initially undefined. :issue:<code>1253</code></li> </ul> </blockquote> </details> <details> <summary>Commits</summary> <ul> <li><a href="`877f6e51be`"><code>877f6e5</code></a> release version 3.1.5</li> <li><a href="`8d58859265`"><code>8d58859</code></a> remove test pypi</li> <li><a href="`eda8fe86fd`"><code>eda8fe8</code></a> update dev dependencies</li> <li><a href="`c8fdce1e03`"><code>c8fdce1</code></a> Fix bug involving calling set on a template parameter within all branches of ...</li> <li><a href="`66587ce989`"><code>66587ce</code></a> Fix bug where set would sometimes fail within if</li> <li><a href="`fbc3a696c7`"><code>fbc3a69</code></a> Add support for namespaces in tuple parsing (<a href="https://redirect.github.com/pallets/jinja/issues/1664">#1664</a>)</li> <li><a href="`b8f4831d41`"><code>b8f4831</code></a> more comments about nsref assignment</li> <li><a href="`ee832194cd`"><code>ee83219</code></a> Add support for namespaces in tuple assignment</li> <li><a href="`1d55cddbb2`"><code>1d55cdd</code></a> Triple quotes in docs (<a href="https://redirect.github.com/pallets/jinja/issues/2064">#2064</a>)</li> <li><a href="`8a8eafc6b9`"><code>8a8eafc</code></a> edit block assignment section</li> <li>Additional commits viewable in <a href="https://github.com/pallets/jinja/compare/3.1.4...3.1.5">compare view</a></li> </ul> </details> <br /> [![Dependabot compatibility score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=jinja2&package-manager=pip&previous-version=3.1.4&new-version=3.1.5)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- <details> <summary>Dependabot commands and options</summary> <br /> You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot show <dependency name> ignore conditions` will show all of the ignore conditions of the specified dependency - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) You can disable automated security fix PRs for this repo from the [Security Alerts page](https://github.com/pytorch/pytorch/network/alerts). </details> Pull Request resolved: https://github.com/pytorch/pytorch/pull/143844 Approved by: https://github.com/Skylion007 Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2024-12-26 05:20:06 +00:00
cyy	e05bfb8ee3	[Submodule] Bump libfmt to 11.1.0 (#143843 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/143843 Approved by: https://github.com/Skylion007	2024-12-26 04:49:11 +00:00
Raymond Li	4bacfd6e11	Sort requirements.txt (#143778 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143778 Approved by: https://github.com/albanD	2024-12-26 00:51:52 +00:00
cyy	f42cff4e29	[17/N] Fix extra warnings brought by clang-tidy-17 (#143804 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/143804 Approved by: https://github.com/Skylion007	2024-12-25 19:54:42 +00:00
shaoyuyoung	a8ac3a6b20	[inductor] fix the `adaptive_avg_pool` on processing int64 (#143802 ) Fixes #143801 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143802 Approved by: https://github.com/jansel	2024-12-25 09:08:43 +00:00
Tal Ben-Nun	c0d710634f	Respect ROCR_VISIBLE_DEVICES on AMD GPU device discovery (#142292 ) Reland of #140320 after failing test on trunk. Fixes potential environment clobbering in test, makes ROCr+HIP devices (if specified together) more robust to index errors. Fixes #140318 Pull Request resolved: https://github.com/pytorch/pytorch/pull/142292 Approved by: https://github.com/jataylo, https://github.com/huydhn, https://github.com/jeffdaily Co-authored-by: Jack Taylor <108682042+jataylo@users.noreply.github.com> Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2024-12-25 02:37:11 +00:00
Natalia Gimelshein	7013be0094	Use random64 in Fischer-Yates algorithm for large N (#143682 ) Fixes bug in randperm https://nbsanity.com/static/a4774194938414dedcec7d6e99727d31/Shuffling_20in_20torch_20vs_20numpy-public.html Pull Request resolved: https://github.com/pytorch/pytorch/pull/143682 Approved by: https://github.com/eqy, https://github.com/albanD	2024-12-25 01:19:19 +00:00
Jack Taylor	27b0d41f0a	[ROCm] Add miopen_batch_norm to meta_registrations to fix AOTI issue (#143569 ) Currently the upstream example for AOTI usage breaks on ROCm (https://pytorch.org/tutorials/recipes/torch_export_aoti_python.html) ``` File "/root/upstream/torch/_dynamo/exc.py", line 317, in unimplemented raise Unsupported(msg, case_name=case_name) torch._dynamo.exc.Unsupported: unsupported operator: aten.miopen_batch_norm.default (see https://docs.google.com/document/d/1GgvOe7C8_NVOMLOCwDaYV1mXXyHMXY7ExoewHqooxrs/edit#heading=h.64r4npvq0w0 for how to fix) from user code: File "/root/vision/torchvision/models/resnet.py", line 285, in forward return self._forward_impl(x) File "/root/vision/torchvision/models/resnet.py", line 269, in _forward_impl x = self.bn1(x) ``` This PR adds a meta_registration for miopen_batch_norm to resolve this issue Pull Request resolved: https://github.com/pytorch/pytorch/pull/143569 Approved by: https://github.com/jeffdaily	2024-12-24 23:43:11 +00:00
Jason Ansel	9035fb5a7b	[dynamo] Add types to exc.py (#143626 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143626 Approved by: https://github.com/yanboliang ghstack dependencies: #143552, #143610	2024-12-24 21:48:32 +00:00
Jason Ansel	3e7f9e2cc4	[inductor] Shorten tracebacks for errors inside inductor (by skipping AOTAutograd frames) (#143610 ) Before #143552 ```py Traceback (most recent call last): File "/home/jansel/pytorch/repro.py", line 51, in <module> fp32_compiled = optimized_model(low_input) ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl return self._call_impl(args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/nn/modules/module.py", line 1750, in _call_impl return forward_call(args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_dynamo/eval_frame.py", line 576, in _fn return fn(args, *kwargs) ^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl return self._call_impl(args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/nn/modules/module.py", line 1750, in _call_impl return forward_call(args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_dynamo/convert_frame.py", line 1381, in __call__ return self._torchdynamo_orig_callable( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_dynamo/convert_frame.py", line 1165, in __call__ result = self._inner_convert( ^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_dynamo/convert_frame.py", line 547, in __call__ return _compile( ^^^^^^^^^ File "/home/jansel/pytorch/torch/_dynamo/convert_frame.py", line 987, in _compile guarded_code = compile_inner(code, one_graph, hooks, transform) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_dynamo/convert_frame.py", line 715, in compile_inner return _compile_inner(code, one_graph, hooks, transform) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_utils_internal.py", line 95, in wrapper_function return function(args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_dynamo/convert_frame.py", line 750, in _compile_inner out_code = transform_code_object(code, transform) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_dynamo/bytecode_transformation.py", line 1361, in transform_code_object transformations(instructions, code_options) File "/home/jansel/pytorch/torch/_dynamo/convert_frame.py", line 231, in _fn return fn(args, kwargs) ^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_dynamo/convert_frame.py", line 662, in transform tracer.run() File "/home/jansel/pytorch/torch/_dynamo/symbolic_convert.py", line 2870, in run super().run() File "/home/jansel/pytorch/torch/_dynamo/symbolic_convert.py", line 1053, in run while self.step(): ^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_dynamo/symbolic_convert.py", line 963, in step self.dispatch_table[inst.opcode](self, inst) File "/home/jansel/pytorch/torch/_dynamo/symbolic_convert.py", line 3050, in RETURN_VALUE self._return(inst) File "/home/jansel/pytorch/torch/_dynamo/symbolic_convert.py", line 3035, in _return self.output.compile_subgraph( File "/home/jansel/pytorch/torch/_dynamo/output_graph.py", line 1101, in compile_subgraph self.compile_and_call_fx_graph( File "/home/jansel/pytorch/torch/_dynamo/output_graph.py", line 1382, in compile_and_call_fx_graph compiled_fn = self.call_user_compiler(gm) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_dynamo/output_graph.py", line 1432, in call_user_compiler return self._call_user_compiler(gm) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_dynamo/output_graph.py", line 1483, in _call_user_compiler raise BackendCompilerFailed(self.compiler_fn, e).with_traceback( File "/home/jansel/pytorch/torch/_dynamo/output_graph.py", line 1462, in _call_user_compiler compiled_fn = compiler_fn(gm, self.example_inputs()) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_dynamo/repro/after_dynamo.py", line 130, in __call__ compiled_gm = compiler_fn(gm, example_inputs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/__init__.py", line 2314, in __call__ return compile_fx(model_, inputs_, config_patches=self.config) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_inductor/compile_fx.py", line 1880, in compile_fx return aot_autograd( ^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_dynamo/backends/common.py", line 83, in __call__ cg = aot_module_simplified(gm, example_inputs, self.kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_functorch/aot_autograd.py", line 1145, in aot_module_simplified compiled_fn = AOTAutogradCache.load( ^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_functorch/_aot_autograd/autograd_cache.py", line 754, in load compiled_fn = dispatch_and_compile() ^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_functorch/aot_autograd.py", line 1131, in dispatch_and_compile compiled_fn, _ = create_aot_dispatcher_function( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_functorch/aot_autograd.py", line 580, in create_aot_dispatcher_function return _create_aot_dispatcher_function( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_functorch/aot_autograd.py", line 830, in _create_aot_dispatcher_function compiled_fn, fw_metadata = compiler_fn( ^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_functorch/_aot_autograd/jit_compile_runtime_wrappers.py", line 676, in aot_dispatch_autograd compiled_fw_func = aot_config.fw_compiler(fw_module, adjusted_flat_args) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_functorch/aot_autograd.py", line 489, in __call__ return self.compiler_fn(gm, example_inputs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_inductor/compile_fx.py", line 1758, in fw_compiler_base return inner_compile( ^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_inductor/compile_fx.py", line 572, in compile_fx_inner return wrap_compiler_debug(_compile_fx_inner, compiler_name="inductor")( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_dynamo/repro/after_aot.py", line 102, in debug_wrapper inner_compiled_fn = compiler_fn(gm, example_inputs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_inductor/compile_fx.py", line 686, in _compile_fx_inner mb_compiled_graph = fx_codegen_and_compile( ^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_inductor/compile_fx.py", line 1129, in fx_codegen_and_compile return scheme.codegen_and_compile(gm, example_inputs, inputs_to_check, graph_kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_inductor/compile_fx.py", line 1044, in codegen_and_compile compiled_fn = graph.compile_to_module().call ^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_inductor/graph.py", line 1975, in compile_to_module return self._compile_to_module() ^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_inductor/graph.py", line 1981, in _compile_to_module self.codegen_with_cpp_wrapper() if self.cpp_wrapper else self.codegen() ^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_inductor/graph.py", line 1912, in codegen self.scheduler = Scheduler(self.operations) ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_inductor/scheduler.py", line 1880, in __init__ self._init(nodes) File "/home/jansel/pytorch/torch/_inductor/scheduler.py", line 1955, in _init self.nodes = self.fuse_nodes(self.nodes) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_inductor/scheduler.py", line 2461, in fuse_nodes nodes = self.fuse_nodes_once(nodes) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_inductor/scheduler.py", line 2773, in fuse_nodes_once assert False, "a fake error during fusion" ^^^^^ torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised: AssertionError: a fake error during fusion Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information ``` Before this PR ```py Traceback (most recent call last): File "/home/jansel/pytorch/repro.py", line 51, in <module> fp32_compiled = optimized_model(low_input) ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl return self._call_impl(args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/nn/modules/module.py", line 1750, in _call_impl return forward_call(args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_dynamo/eval_frame.py", line 580, in _fn raise e.remove_dynamo_frames() from None # see TORCHDYNAMO_VERBOSE=1 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_dynamo/output_graph.py", line 1484, in _call_user_compiler raise BackendCompilerFailed( File "/home/jansel/pytorch/torch/_dynamo/output_graph.py", line 1463, in _call_user_compiler compiled_fn = compiler_fn(gm, self.example_inputs()) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_dynamo/repro/after_dynamo.py", line 130, in __call__ compiled_gm = compiler_fn(gm, example_inputs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/__init__.py", line 2314, in __call__ return compile_fx(model_, inputs_, config_patches=self.config) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_inductor/compile_fx.py", line 1880, in compile_fx return aot_autograd( ^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_dynamo/backends/common.py", line 83, in __call__ cg = aot_module_simplified(gm, example_inputs, self.kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_functorch/aot_autograd.py", line 1145, in aot_module_simplified compiled_fn = AOTAutogradCache.load( ^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_functorch/_aot_autograd/autograd_cache.py", line 754, in load compiled_fn = dispatch_and_compile() ^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_functorch/aot_autograd.py", line 1131, in dispatch_and_compile compiled_fn, _ = create_aot_dispatcher_function( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_functorch/aot_autograd.py", line 580, in create_aot_dispatcher_function return _create_aot_dispatcher_function( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_functorch/aot_autograd.py", line 830, in _create_aot_dispatcher_function compiled_fn, fw_metadata = compiler_fn( ^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_functorch/_aot_autograd/jit_compile_runtime_wrappers.py", line 676, in aot_dispatch_autograd compiled_fw_func = aot_config.fw_compiler(fw_module, adjusted_flat_args) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_functorch/aot_autograd.py", line 489, in __call__ return self.compiler_fn(gm, example_inputs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_inductor/compile_fx.py", line 1758, in fw_compiler_base return inner_compile( ^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_inductor/compile_fx.py", line 572, in compile_fx_inner return wrap_compiler_debug(_compile_fx_inner, compiler_name="inductor")( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_dynamo/repro/after_aot.py", line 102, in debug_wrapper inner_compiled_fn = compiler_fn(gm, example_inputs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_inductor/compile_fx.py", line 686, in _compile_fx_inner mb_compiled_graph = fx_codegen_and_compile( ^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_inductor/compile_fx.py", line 1129, in fx_codegen_and_compile return scheme.codegen_and_compile(gm, example_inputs, inputs_to_check, graph_kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_inductor/compile_fx.py", line 1044, in codegen_and_compile compiled_fn = graph.compile_to_module().call ^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_inductor/graph.py", line 1975, in compile_to_module return self._compile_to_module() ^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_inductor/graph.py", line 1981, in _compile_to_module self.codegen_with_cpp_wrapper() if self.cpp_wrapper else self.codegen() ^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_inductor/graph.py", line 1912, in codegen self.scheduler = Scheduler(self.operations) ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_inductor/scheduler.py", line 1880, in __init__ self._init(nodes) File "/home/jansel/pytorch/torch/_inductor/scheduler.py", line 1955, in _init self.nodes = self.fuse_nodes(self.nodes) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_inductor/scheduler.py", line 2461, in fuse_nodes nodes = self.fuse_nodes_once(nodes) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_inductor/scheduler.py", line 2773, in fuse_nodes_once assert False, "a fake error during fusion" ^^^^^ torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised: AssertionError: a fake error during fusion Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information ``` After this PR ```py Traceback (most recent call last): File "/home/jansel/pytorch/repro.py", line 51, in <module> fp32_compiled = optimized_model(low_input) ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl return self._call_impl(args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/nn/modules/module.py", line 1750, in _call_impl return forward_call(args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_dynamo/eval_frame.py", line 580, in _fn raise e.remove_dynamo_frames() from None # see TORCHDYNAMO_VERBOSE=1 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_inductor/compile_fx.py", line 704, in _compile_fx_inner raise InductorError(e, currentframe()).with_traceback( File "/home/jansel/pytorch/torch/_inductor/compile_fx.py", line 689, in _compile_fx_inner mb_compiled_graph = fx_codegen_and_compile( ^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_inductor/compile_fx.py", line 1138, in fx_codegen_and_compile return scheme.codegen_and_compile(gm, example_inputs, inputs_to_check, graph_kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_inductor/compile_fx.py", line 1053, in codegen_and_compile compiled_fn = graph.compile_to_module().call ^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_inductor/graph.py", line 1975, in compile_to_module return self._compile_to_module() ^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_inductor/graph.py", line 1981, in _compile_to_module self.codegen_with_cpp_wrapper() if self.cpp_wrapper else self.codegen() ^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_inductor/graph.py", line 1912, in codegen self.scheduler = Scheduler(self.operations) ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_inductor/scheduler.py", line 1880, in __init__ self._init(nodes) File "/home/jansel/pytorch/torch/_inductor/scheduler.py", line 1955, in _init self.nodes = self.fuse_nodes(self.nodes) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_inductor/scheduler.py", line 2461, in fuse_nodes nodes = self.fuse_nodes_once(nodes) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_inductor/scheduler.py", line 2773, in fuse_nodes_once assert False, "a fake error during fusion" ^^^^^ torch._inductor.exc.InductorError: AssertionError: a fake error during fusion Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information ``` A large numer of frames are removed between: ```py File "/home/jansel/pytorch/torch/_dynamo/eval_frame.py", line 580, in _fn raise e.remove_dynamo_frames() from None # see TORCHDYNAMO_VERBOSE=1 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_inductor/compile_fx.py", line 704, in _compile_fx_inner raise InductorError(e, currentframe()).with_traceback( ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/143610 Approved by: https://github.com/eellison ghstack dependencies: #143552	2024-12-24 21:48:32 +00:00
Jason Ansel	9e5f3fdfc7	[dynamo] Shorten tracebacks for backend compiler errors (#143552 ) Fixes #143406 After this PR the error for missing Triton is: ```py Traceback (most recent call last): File "/home/jansel/pytorch/repro.py", line 51, in <module> fp32_compiled = optimized_model(low_input) ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl return self._call_impl(args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/nn/modules/module.py", line 1750, in _call_impl return forward_call(args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_dynamo/eval_frame.py", line 580, in _fn raise e.remove_dynamo_frames() from None # see TORCHDYNAMO_VERBOSE=1 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_inductor/scheduler.py", line 3624, in create_backend raise TritonMissing(inspect.currentframe()) torch._dynamo.exc.TritonMissing: Cannot find a working triton installation. Either the package is not installed or it is too old. More information on installing Triton can be found at: https://github.com/triton-lang/triton Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information You can suppress this exception and fall back to eager by setting: import torch._dynamo torch._dynamo.config.suppress_errors = True ``` Setting `TORCHDYNAMO_VERBOSE=1` yields something like the old error: ```py Traceback (most recent call last): File "/home/jansel/pytorch/repro.py", line 51, in <module> fp32_compiled = optimized_model(low_input) ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl return self._call_impl(args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/nn/modules/module.py", line 1750, in _call_impl return forward_call(args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_dynamo/eval_frame.py", line 580, in _fn raise e.remove_dynamo_frames() from None # see TORCHDYNAMO_VERBOSE=1 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_dynamo/eval_frame.py", line 576, in _fn return fn(args, *kwargs) ^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl return self._call_impl(args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/nn/modules/module.py", line 1750, in _call_impl return forward_call(args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_dynamo/convert_frame.py", line 1383, in __call__ return self._torchdynamo_orig_callable( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_dynamo/convert_frame.py", line 1167, in __call__ result = self._inner_convert( ^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_dynamo/convert_frame.py", line 548, in __call__ return _compile( ^^^^^^^^^ File "/home/jansel/pytorch/torch/_dynamo/convert_frame.py", line 988, in _compile guarded_code = compile_inner(code, one_graph, hooks, transform) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_dynamo/convert_frame.py", line 716, in compile_inner return _compile_inner(code, one_graph, hooks, transform) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_utils_internal.py", line 95, in wrapper_function return function(args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_dynamo/convert_frame.py", line 751, in _compile_inner out_code = transform_code_object(code, transform) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_dynamo/bytecode_transformation.py", line 1361, in transform_code_object transformations(instructions, code_options) File "/home/jansel/pytorch/torch/_dynamo/convert_frame.py", line 232, in _fn return fn(args, kwargs) ^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_dynamo/convert_frame.py", line 663, in transform tracer.run() File "/home/jansel/pytorch/torch/_dynamo/symbolic_convert.py", line 2870, in run super().run() File "/home/jansel/pytorch/torch/_dynamo/symbolic_convert.py", line 1053, in run while self.step(): ^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_dynamo/symbolic_convert.py", line 963, in step self.dispatch_table[inst.opcode](self, inst) File "/home/jansel/pytorch/torch/_dynamo/symbolic_convert.py", line 3050, in RETURN_VALUE self._return(inst) File "/home/jansel/pytorch/torch/_dynamo/symbolic_convert.py", line 3035, in _return self.output.compile_subgraph( File "/home/jansel/pytorch/torch/_dynamo/output_graph.py", line 1102, in compile_subgraph self.compile_and_call_fx_graph( File "/home/jansel/pytorch/torch/_dynamo/output_graph.py", line 1383, in compile_and_call_fx_graph compiled_fn = self.call_user_compiler(gm) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_dynamo/output_graph.py", line 1433, in call_user_compiler return self._call_user_compiler(gm) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_dynamo/output_graph.py", line 1463, in _call_user_compiler compiled_fn = compiler_fn(gm, self.example_inputs()) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_dynamo/repro/after_dynamo.py", line 130, in __call__ compiled_gm = compiler_fn(gm, example_inputs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/__init__.py", line 2314, in __call__ return compile_fx(model_, inputs_, config_patches=self.config) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_inductor/compile_fx.py", line 1880, in compile_fx return aot_autograd( ^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_dynamo/backends/common.py", line 83, in __call__ cg = aot_module_simplified(gm, example_inputs, self.kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_functorch/aot_autograd.py", line 1145, in aot_module_simplified compiled_fn = AOTAutogradCache.load( ^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_functorch/_aot_autograd/autograd_cache.py", line 754, in load compiled_fn = dispatch_and_compile() ^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_functorch/aot_autograd.py", line 1131, in dispatch_and_compile compiled_fn, _ = create_aot_dispatcher_function( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_functorch/aot_autograd.py", line 580, in create_aot_dispatcher_function return _create_aot_dispatcher_function( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_functorch/aot_autograd.py", line 830, in _create_aot_dispatcher_function compiled_fn, fw_metadata = compiler_fn( ^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_functorch/_aot_autograd/jit_compile_runtime_wrappers.py", line 676, in aot_dispatch_autograd compiled_fw_func = aot_config.fw_compiler(fw_module, adjusted_flat_args) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_functorch/aot_autograd.py", line 489, in __call__ return self.compiler_fn(gm, example_inputs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_inductor/compile_fx.py", line 1758, in fw_compiler_base return inner_compile( ^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_inductor/compile_fx.py", line 572, in compile_fx_inner return wrap_compiler_debug(_compile_fx_inner, compiler_name="inductor")( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_dynamo/repro/after_aot.py", line 102, in debug_wrapper inner_compiled_fn = compiler_fn(gm, example_inputs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_inductor/compile_fx.py", line 686, in _compile_fx_inner mb_compiled_graph = fx_codegen_and_compile( ^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_inductor/compile_fx.py", line 1129, in fx_codegen_and_compile return scheme.codegen_and_compile(gm, example_inputs, inputs_to_check, graph_kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_inductor/compile_fx.py", line 1044, in codegen_and_compile compiled_fn = graph.compile_to_module().call ^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_inductor/graph.py", line 1975, in compile_to_module return self._compile_to_module() ^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_inductor/graph.py", line 1981, in _compile_to_module self.codegen_with_cpp_wrapper() if self.cpp_wrapper else self.codegen() ^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_inductor/graph.py", line 1916, in codegen self.scheduler.codegen() File "/home/jansel/pytorch/torch/_inductor/scheduler.py", line 3667, in codegen return self._codegen() ^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_inductor/scheduler.py", line 3761, in _codegen if device is not None and self.get_backend(device).ready_to_flush(): ^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_inductor/scheduler.py", line 3631, in get_backend self.backends[device] = self.create_backend(device) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jansel/pytorch/torch/_inductor/scheduler.py", line 3624, in create_backend raise TritonMissing(inspect.currentframe()) torch._dynamo.exc.TritonMissing: Cannot find a working triton installation. Either the package is not installed or it is too old. More information on installing Triton can be found at: https://github.com/triton-lang/triton You can suppress this exception and fall back to eager by setting: import torch._dynamo torch._dynamo.config.suppress_errors = True ``` This PR also strips dynamo stack frames from other types of backend compile errors. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143552 Approved by: https://github.com/yanboliang	2024-12-24 21:48:23 +00:00
PyTorch MergeBot	844e6108f6	Revert "[Inductor XPU] Support max-autotune on XPU and reuse the corresponding Inductor UT. (#143266 )" This reverts commit ad750ae32079020f51f9b7d01237f3ecfa83b6ff. Reverted https://github.com/pytorch/pytorch/pull/143266 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing some tests in trunk ([comment](https://github.com/pytorch/pytorch/pull/143266#issuecomment-2561303786))	2024-12-24 17:22:57 +00:00
atalman	6c32ef4c5b	Remove builder repo from workflows and scripts (#143776 ) Part of https://github.com/pytorch/builder/issues/2054 Builder is repo is no longer used. Hence remove any references to builder repo Pull Request resolved: https://github.com/pytorch/pytorch/pull/143776 Approved by: https://github.com/huydhn	2024-12-24 14:11:51 +00:00
Luca Wehrstedt	aec3b46274	[DTensor] Add aten.amin/amax to linear_reduction_strategy (#143747 ) In the same vein as https://github.com/pytorch/pytorch/pull/134206, these two ops still seemed missing. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143747 Approved by: https://github.com/kwen2501	2024-12-24 13:36:40 +00:00
Xuehai Pan	b77406a9ec	[BE][CI] bump `ruff` to 0.8.4 (#143753 ) Changes: 1. Bump `ruff` from 0.7.4 to 0.8.4 2. Change `%`-formatted strings to f-string 3. Change arguments with the `__`-prefix to positional-only arguments with the `/` separator in function signature. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143753 Approved by: https://github.com/Skylion007	2024-12-24 12:24:10 +00:00
Iurii Paikov	dbbc81cb34	Enabled force_shape_pad for test_pad_mm and test_slice_mm_bandwidth_computation (#141768 ) Some tests fail for ROCm build on navi arch because of this check: `f83361b274/torch/_inductor/fx_passes/pad_mm.py (L211)` There is no need to determine if mm is compute bound for most of the padding tests since they don't specifically test compute bound behavior. We don't have enough empirical data to fine tune this check for AMD gpus yet. I propose to force the shape padding for the tests that we had trouble with to avoid this unnecessary logic path. Please correct me if I didn't add other tests that can potentially fail with this issue or if I added a test that is dependent on logic below the `force_shape_pad` check here: `f83361b274/torch/_inductor/fx_passes/pad_mm.py (L444)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/141768 Approved by: https://github.com/jeffdaily	2024-12-24 11:03:39 +00:00
Jiang, Yanbing	783065637e	Add FP8 support for eye (#139974 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139974 Approved by: https://github.com/jgong5, https://github.com/malfet	2024-12-24 10:00:23 +00:00
Jason Ansel	060ee14753	[inductor] Make adaptive_max_pool2d error on int64 (#143762 ) Fixes #143752 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143762 Approved by: https://github.com/yanboliang	2024-12-24 08:33:59 +00:00
Xuehai Pan	135c7db99d	Use absolute path `path.resolve()` -> `path.absolute()` (#129409 ) Changes: 1. Always explicit `.absolute()`: `Path(__file__)` -> `Path(__file__).absolute()` 2. Replace `path.resolve()` with `path.absolute()` if the code is resolving the PyTorch repo root directory. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129409 Approved by: https://github.com/albanD	2024-12-24 08:33:08 +00:00
Jithun Nair	362ecad9bb	[ROCm] Use `linux.rocm.gpu.2` for 2-GPU and `linux.rocm.gpu.4` for 4-GPU runners (#143769 ) * Will enable us to target `periodic`/distributed CI jobs to 4-GPU runners using a different label `linux.rocm.gpu.4` * Use 2-GPU runners for `trunk`, `pull` and `slow` (in addition to `inductor-rocm`) as well (although this currently will not change anything, since all our MI2xx runners have both `linux.rocm.gpu` and `linux.rocm.gpu.2` labels... but this will change in the future: see next point) * Continue to use `linux.rocm.gpu` label for any job that doesn't need more than 1-GPU eg. binary test jobs in `workflows/generated-linux-binary-manywheel-nightly.yml` Pull Request resolved: https://github.com/pytorch/pytorch/pull/143769 Approved by: https://github.com/jeffdaily	2024-12-24 08:04:00 +00:00
Yifu Wang	1963fc83a1	[micro_pipeline_tp] don't pass return_A to fused_all_gather_scaled_matmul (#143782 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143782 Approved by: https://github.com/tianyu-l	2024-12-24 07:25:38 +00:00
xinan.lin	ad750ae320	[Inductor XPU] Support max-autotune on XPU and reuse the corresponding Inductor UT. (#143266 ) This PR aims to add the functionality support of max-autotune for XPU. The current triton templates and configurations are not well optimized for XPU, so the performance is not ready yet. Also the `mm_plus_mm` template have accuracy issues in some cases. We will address these issues in the next PRs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143266 Approved by: https://github.com/EikanWang, https://github.com/jansel	2024-12-24 05:42:36 +00:00
Jason Ansel	b0c3f48a40	[inductor] Improve error message for assert_size_stride (#143765 ) ``` >>> torch._C._dynamo.guards.assert_size_stride(torch.randn(10), (10,), (2,)) Traceback (most recent call last): File "<stdin>", line 1, in <module> AssertionError: expected size 10==10, stride 1==2 at dim=0 This error most often comes from an incorrect meta function for a custom op. See https://pytorch.org/docs/stable/library.html#torch.library.opcheck >>> ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/143765 Approved by: https://github.com/zou3519	2024-12-24 05:26:05 +00:00
Jerry Zhang	ace645a017	Add support for prototype affine quantization in pt2e flow (#141421 ) Summary: duplicated affine quantization functionality including observer (https://github.com/pytorch/ao/blob/main/torchao/quantization/observer.py) and some quant_primitive ops (`7c3c51fd0d/torchao/quantization/quant_primitives.py (L26-L30)`) to allow for per group quantization min max observer in pt2e flow Next: We can follow up to add moving average min max observer Test Plan: python test/test_quantization.py -k test_channel_group_quantization Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/141421 Approved by: https://github.com/cccclai	2024-12-24 04:22:18 +00:00
Jason Ansel	60a0d53c13	[dynamo] Add test for #143697 (#143764 ) The issue from #143697 seems to already be fixed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143764 Approved by: https://github.com/Skylion007	2024-12-24 03:50:15 +00:00
zeshengzong	01d60bcf32	[Easy] Fix todo by enable tests for cuda (#143637 ) Fix TODO in `test_tensor_creation_ops.py` file: ```python # TODO: update to work on CUDA, too ``` Test Result ```bash $ pytest test/test_tensor_creation_ops.py ``` ![image](https://github.com/user-attachments/assets/ef829541-668e-446d-a9ab-b26b9d73085f) ```bash $ lintrunner ``` ![image](https://github.com/user-attachments/assets/d6a46eee-1f60-48e6-898a-a8d9620eb54a) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143637 Approved by: https://github.com/albanD	2024-12-24 03:47:43 +00:00
Eddie Yan	b90a3b7281	[cumsum][CUDA][64-bit indexing] Add 64-bit indexing path for `cumsum` (#143696 ) For #143486 Interestingly enough changing the indexing type seems to degrade performance when a larger width is not needed, even on small sizes, so making this a template param rather than forcing all cases to 64-bit Pull Request resolved: https://github.com/pytorch/pytorch/pull/143696 Approved by: https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2024-12-24 03:45:28 +00:00
Jason Ansel	dec4286b2d	[inductor] Fix for extract_target with dots (#143766 ) Fixes #143650 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143766 Approved by: https://github.com/yanboliang	2024-12-24 03:42:15 +00:00
cyy	1feae27ed6	[16/N] Fix extra warnings brought by clang-tidy-17 (#143714 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/143714 Approved by: https://github.com/Skylion007, https://github.com/albanD	2024-12-24 03:29:38 +00:00
PyTorch MergeBot	49fdc52fd2	Revert "Add a warning when a tensor with requires_grad=True is converted to a scalar (#143261 )" This reverts commit bc78b6ea4f88d673426d6de17671b82facf50beb. Reverted https://github.com/pytorch/pytorch/pull/143261 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing lint, plz help fix and reland this ([comment](https://github.com/pytorch/pytorch/pull/143261#issuecomment-2560583332))	2024-12-24 03:15:38 +00:00
cyy	d6a066ead6	Simplify host_softmax (#143251 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/143251 Approved by: https://github.com/albanD	2024-12-24 02:27:51 +00:00
Nikita Shulga	da21fabf34	[BE] Only print MKL version on x86 platforms (#143763 ) As it will obviously be missing on ARM/S390, etc Test plan: run `python3 -c "import torch;print(torch.__config__.parallel_info())"` on both x86 and non-x86 system Pull Request resolved: https://github.com/pytorch/pytorch/pull/143763 Approved by: https://github.com/Skylion007, https://github.com/albanD	2024-12-24 02:04:26 +00:00
Animesh Jain	7d1c666139	[dynamo] Remove dead code after introducing UserDefinedDictVariable (#143699 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143699 Approved by: https://github.com/williamwen42, https://github.com/yanboliang, https://github.com/jansel ghstack dependencies: #143722	2024-12-24 02:00:18 +00:00
Animesh Jain	fe95cbe018	[dynamo] Remove DICT_SUBCLASS_GUARD_MANAGER and use dict.keys (#143722 ) In hinsight, we never needed a DICT_SUBCLASS_GUARD_MANAGER, because Dynamo would inline through the overridden keys method. In this PR, we ensure that while creating guards and constructing variable trackers, we get the `d.keys()` value by using `dict.keys(d)`. This ensures that we do not call overridden keys method. Therefore, the C++ guard can use `PyDict_Next` directly to check the guards. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143722 Approved by: https://github.com/jansel	2024-12-24 02:00:18 +00:00
zeshengzong	67355a1289	[Easy] Add torch.range, torch.arange params optional description (#143731 ) Fixes #129333 Test Result Before ![image](https://github.com/user-attachments/assets/c5873690-7de7-4a14-9423-a150d17d137e) ![image](https://github.com/user-attachments/assets/ff4ee545-f27a-403b-bf92-51f9571022a3) After ![image](https://github.com/user-attachments/assets/34e2c41f-8b54-417d-bb10-7ca6f679206a) ![image](https://github.com/user-attachments/assets/b54bcebd-70e9-4a1a-8a22-1ab815e17827) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143731 Approved by: https://github.com/janeyx99	2024-12-24 01:29:24 +00:00
Jithun Nair	0ca6a47872	Update tag_regex in filter_test_configs.py for workflows such as `inductor-rocm` (#143768 ) This helps to make `continue-through-error`/`keep-going` work as expected on `inductor-rocm` workflow jobs. Without this, the code here doesn't enter the `if` condition: `6ccb8ed186/.github/scripts/filter_test_configs.py (L577)` Tested via [this PR](https://github.com/pytorch/pytorch/pull/140989): Without this change: https://hud.pytorch.org/pytorch/pytorch/pull/140989?sha=8232e18957f987d99c946efc0cf6da9be9b52067: https://github.com/pytorch/pytorch/actions/runs/12164558045/job/34192442187#step:13:144 With this change: https://hud.pytorch.org/pytorch/pytorch/pull/140989?sha=763179c5e421791ee05c8e2a600379b29a1c8c33: https://github.com/pytorch/pytorch/actions/runs/12261943684/job/34213300153#step:13:145 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143768 Approved by: https://github.com/huydhn	2024-12-24 00:50:14 +00:00
Joshua Hamilton	bc78b6ea4f	Add a warning when a tensor with requires_grad=True is converted to a scalar (#143261 ) Fixes #143071 Operations performed on tensors with `requires_grad=True` such as ```python import torch x = torch.tensor(2.0, requires_grad=True) y = x ** 3 ``` and ```python x = torch.tensor(2.0, requires_grad=True) y = torch.pow(x,3) ``` are valid operations. While an operation using `numpy` like ```python import numpy as np x = torch.tensor(2.0, requires_grad=True) y = np.pow(x,3) # > RuntimeError: Can't call numpy() on Tensor that requires grad. Use tensor.detach().numpy() instead. ``` leads to an error. However, an operation that uses `math` like ```python import math x = torch.tensor(2.0, requires_grad=True) y = math.pow(x,3) ``` does not cause an error, and `y` is no longer a tensor with a gradient! This represents a [footgun](https://en.wiktionary.org/wiki/footgun#Noun) for some users, like myself when training small, custom, non-neural network models. To prevent future undesired behavior, I added a warning when converting tensors with `requires_grad=True` to scalars. Now, when using `math.pow` on a `tensor`, we get a single warning with: ```python x = torch.tensor(2.0, requires_grad=True) y = math.pow(x,3) # > UserWarning: Converting a tensor with requires_grad=True to a scalar may lead to unexpected behavior. # Consider using tensor.detach() first. ``` Please let me know if you have any questions 👍 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143261 Approved by: https://github.com/albanD	2024-12-24 00:22:18 +00:00
emmettbicker	6ccb8ed186	Refactor AdamW into Adam (heavily inspired by tfsingh) (#143710 ) Fixes #104899 Refactors AdamW into Adam by making AdamW a subclass of Adam. Additionally adds a test to assert that the added parameter `decoupled_weight_decay` is True in AdamW and also updates test_defaults_changed_to_foreach to account for the differences in module location for AdamW. Heavily heavily inspired by #118857 by @tfsingh Pull Request resolved: https://github.com/pytorch/pytorch/pull/143710 Approved by: https://github.com/janeyx99	2024-12-23 23:27:28 +00:00
Sam Larsen	4271a95590	[logging] A few fixes/updates to record_compilation_metrics (#143332 ) Summary: Mostly cosmetic, but one bug fix: * Bug fix: Make sure compile_id is converted to a string in the compilation metrics so it's printed as, e.g., "0/1" instead of "[0, 1]" * Sort collections in `collection_to_str` * Print non-string elements as `"<unknown>"` instead of None (since we don't expect non-strings) * Move the population of the legacy metrics and any pre-processing to a new factory method in CompilationMetrics Test Plan: ``` python test/dynamo/test_structured_trace.py python test/dynamo/test_utils.py ``` Internal testing: https://fburl.com/scuba/dynamo_compile/sandbox/l0me8auf Pull Request resolved: https://github.com/pytorch/pytorch/pull/143332 Approved by: https://github.com/ppanchalia	2024-12-23 23:10:11 +00:00
Natalia Gimelshein	2ab698e708	allow profiling on all threads via experimentalConfig (#143659 ) In some situations we want to profile calls coming from all threads (similar to on-demand), not just the thread that started profiling and the spawned threads that would inherit KinetoThreadLocal state. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143659 Approved by: https://github.com/sraikund16	2024-12-23 20:41:27 +00:00
Aaron Gokaslan	00831f9b22	[BE]: Properly forward raise pickle exception with from (#143761 ) Properly raises the pickle exception with from. Provides a more informative stack trace and forwards information about the exception that led to the current exception. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143761 Approved by: https://github.com/XuehaiPan, https://github.com/albanD	2024-12-23 20:21:30 +00:00
Jithun Nair	75e1f8a227	[ROCm] upgrade nightly wheels to rocm6.3 - 2 of 2 (binaries) (#143613 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143613 Approved by: https://github.com/jeffdaily	2024-12-23 19:47:30 +00:00
PyTorch MergeBot	0ebc6388cf	Revert "Exclude py 31.3t triton package from PyTorch 3.13t wheel (#143218 )" This reverts commit 3bfdf6f0633e6feb067e032009256c740a2a2665. Reverted https://github.com/pytorch/pytorch/pull/143218 on behalf of https://github.com/atalman due to this constrain is ignored see https://github.com/pytorch/pytorch/issues/143654 ([comment](https://github.com/pytorch/pytorch/pull/143218#issuecomment-2560208992))	2024-12-23 19:37:35 +00:00
Sergii Dymchenko	727ee853b4	Apply TorchFix TOR203 fixes (#143691 ) Codemodded via `torchfix . --select=TOR203 --fix`. This is a step to unblock https://github.com/pytorch/pytorch/pull/141076 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143691 Approved by: https://github.com/malfet	2024-12-23 18:21:03 +00:00
Sergii Dymchenko	c042c8a475	Use default_collate from public API (#143616 ) Codemodded via `torchfix . --select=TOR104 --fix`. This is a step to unblock https://github.com/pytorch/pytorch/pull/141076 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143616 Approved by: https://github.com/malfet	2024-12-23 17:38:43 +00:00
zeshengzong	a70191da41	Add torch.topk indices vary description (#143736 ) Fixes #133542 Test Result Before ![image](https://github.com/user-attachments/assets/65227efb-02af-45e7-804c-35588dff360d) After ![image](https://github.com/user-attachments/assets/91f1f53f-008c-4784-82fe-013404e273cb) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143736 Approved by: https://github.com/zou3519	2024-12-23 17:16:31 +00:00
PyTorch MergeBot	1519a9e30b	Revert "Add FP8 support for eye (#139974 )" This reverts commit 01890526b9068ae20b38b2a33e8f11a6331d7d4b. Reverted https://github.com/pytorch/pytorch/pull/139974 on behalf of https://github.com/huydhn due to Sorry for reverting your change but this seems to fail some slow tests ([comment](https://github.com/pytorch/pytorch/pull/139974#issuecomment-2560046399))	2024-12-23 17:12:39 +00:00
Nikita Shulga	12662901aa	[BE] Move Mac BB test to its own step (#143513 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143513 Approved by: https://github.com/huydhn, https://github.com/atalman, https://github.com/kit1980, https://github.com/seemethere ghstack dependencies: #143395, #143511, #143512	2024-12-23 14:05:10 +00:00
Xuehai Pan	5c4545f857	[BE][Easy] enable PYFMT for `torch/[a-s]*/` (#138447 ) Reproduce command: ```bash ghstack checkout https://github.com/pytorch/pytorch/pull/138447 git checkout HEAD~1 torch/ lintrunner -a --take "PYFMT" --all-files ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/138447 Approved by: https://github.com/ezyang	2024-12-23 14:04:00 +00:00
Dmitry Rogozhkin	7314cf44ae	torch/accelerator: fix device type comparison (#143541 ) This was failing without the fix: ``` python -c 'import torch; d=torch.device("xpu:0"); torch.accelerator.current_stream(d)' ``` with: ``` ValueError: xpu doesn't match the current accelerator xpu. ``` CC: @guangyey, @EikanWang Pull Request resolved: https://github.com/pytorch/pytorch/pull/143541 Approved by: https://github.com/guangyey, https://github.com/albanD	2024-12-23 10:54:53 +00:00
Kai Londenberg	434e0c2104	Inductor Cutlass backend: Eliminate unused code. (#143723 ) Summary: Eliminates an unused file and some smaller unused code fragments from the inductor cutlass codebase. Test Plan: CI Differential Revision: D67579837 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143723 Approved by: https://github.com/ColinPeppler	2024-12-23 09:35:03 +00:00
Jiang, Yanbing	01890526b9	Add FP8 support for eye (#139974 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139974 Approved by: https://github.com/jgong5, https://github.com/malfet	2024-12-23 06:47:49 +00:00
PyTorch MergeBot	448c16ac87	Revert "[reland][AMD] Turn on TF32 for aten::mm (#143549 )" This reverts commit 41cdc7f73552cc8a0dbf2d3cb55440c0d6b548ea. Reverted https://github.com/pytorch/pytorch/pull/143549 on behalf of https://github.com/malfet due to It breaks ROCM testing, see `06b4b96b34/1` ([comment](https://github.com/pytorch/pytorch/pull/143549#issuecomment-2559016960))	2024-12-23 06:47:36 +00:00
Aaron Orenstein	06b4b96b34	dynamo tracing perf: no re in arg_ref: 33.9 -> 33.7 (#143069 ) See #143056 for overall docs. This PR: Avoid use of python re and move valid varname check in `GuardBuilder.arg_ref()` into C++ Pull Request resolved: https://github.com/pytorch/pytorch/pull/143069 Approved by: https://github.com/jansel	2024-12-23 05:32:09 +00:00
Yu, Guangye	07fa6e2c8b	Fix torch.accelerator api abort when passing invaild device (#143550 ) # Motivation Fix https://github.com/pytorch/pytorch/issues/143543 # Solution We should raise python exception instead of aborting... # Additional Context without this PR: ```python >>> import torch >>> torch.accelerator.current_stream(torch.accelerator.device_count()) terminate called after throwing an instance of 'c10::Error' what(): device is out of range, device is 2, total number of device is 2. Exception raised from check_device_index at /home/dvrogozh/git/pytorch/pytorch/c10/xpu/XPUFunctions.h:36 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0xac (0x7f30707eb95c in /home/dvrogozh/git/pytorch/pytorch/torch/lib/libc10.so) frame #1: c10::detail::torchCheckFail(char const, char const, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xf3 (0x7f307078fc57 in /home/dvrogozh/git/pytorch/pytorch/torch/lib/libc10.so) frame #2: <unknown function> + 0x19a3e (0x7f3070c2ba3e in /home/dvrogozh/git/pytorch/pytorch/torch/lib/libc10_xpu.so) frame #3: c10::xpu::getCurrentXPUStream(signed char) + 0x2f (0x7f3070c2c83f in /home/dvrogozh/git/pytorch/pytorch/torch/lib/libc10_xpu.so) frame #4: <unknown function> + 0x1ca35 (0x7f3070c2ea35 in /home/dvrogozh/git/pytorch/pytorch/torch/lib/libc10_xpu.so) frame #5: <unknown function> + 0x653f15 (0x7f3083391f15 in /home/dvrogozh/git/pytorch/pytorch/torch/lib/libtorch_python.so) frame #6: <unknown function> + 0x39e5f2 (0x7f30830dc5f2 in /home/dvrogozh/git/pytorch/pytorch/torch/lib/libtorch_python.so) <omitting python frames> frame #20: <unknown function> + 0x29d90 (0x7f308b19bd90 in /lib/x86_64-linux-gnu/libc.so.6) frame #21: __libc_start_main + 0x80 (0x7f308b19be40 in /lib/x86_64-linux-gnu/libc.so.6) Aborted (core dumped) ``` with this PR: ```python >>> import torch >>> torch.accelerator.current_stream(torch.accelerator.device_count()) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/home/pt-gpu/4T-4652/guangyey/stock-pytorch/torch/accelerator/__init__.py", line 123, in current_stream return torch._C._accelerator_getStream(device_index) RuntimeError: The device index is out of range. It must be in [0, 2), but got 2. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/143550 Approved by: https://github.com/EikanWang, https://github.com/dvrogozh, https://github.com/albanD	2024-12-23 03:44:22 +00:00
Jason Ansel	eebc93d41e	Better fix for f-strings in set_linter for py3.12 (#143725 ) #143628 didn't handle a few cases right for example: ```py $ python3 tools/linter/adapters/set_linter.py torch/_inductor/scheduler.py torch/_inductor/scheduler.py:261:24: Builtin `set` is deprecated 259 \| multiline=False, 260 \| ) 261 \| return f"{self}{data_str}" ^ 262 \| 263 \| def log_details(self) -> None: torch/_inductor/scheduler.py:261:33: Builtin `set` is deprecated 259 \| multiline=False, 260 \| ) 261 \| return f"{self}{data_str}" ^ 262 \| 263 \| def log_details(self) -> None: ``` also multi-line fstrings Pull Request resolved: https://github.com/pytorch/pytorch/pull/143725 Approved by: https://github.com/yanboliang	2024-12-22 22:51:27 +00:00
Xiaodong Wang	41cdc7f735	[reland][AMD] Turn on TF32 for aten::mm (#143549 ) Summary: hipblaslt supports TF32, so adding the support. Original PR https://github.com/pytorch/pytorch/pull/139869 Test Plan: CI Differential Revision: D67431681 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143549 Approved by: https://github.com/eqy	2024-12-22 21:05:05 +00:00
Nikita Shulga	6425f0779d	[BE] Update triton repo link (#143429 ) It should be https://github.com/triton-lang/triton rather than https://github.com/openai/triton shouldn't it? Pull Request resolved: https://github.com/pytorch/pytorch/pull/143429 Approved by: https://github.com/jansel	2024-12-22 18:38:35 +00:00
Nikita Shulga	a316a4581d	Add mps to GPU_TYPES (#143634 ) Because it is a GPU, but don't require a triton, as it does not need one Pull Request resolved: https://github.com/pytorch/pytorch/pull/143634 Approved by: https://github.com/jansel	2024-12-22 18:37:35 +00:00
cyy	09c950cc87	Remove unused <ATen/core/Array.h> inclusion (#143701 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/143701 Approved by: https://github.com/albanD	2024-12-22 14:30:11 +00:00
Oguz Ulgen	dc55704b48	Rename cache limit to recompile limit in configs (#143709 ) This PR renames every cache_limit to recompile_limit via sed. Old config options are maintained via Config(alias='xyz') Pull Request resolved: https://github.com/pytorch/pytorch/pull/143709 Approved by: https://github.com/jansel	2024-12-22 10:03:57 +00:00
Aaron Orenstein	9bf4b1c2e9	dynamo tracing perf: c++ strip_function_call: 49.12 -> 47.77 (#143063 ) See #143056 for overall docs. This PR: Convert `strip_function_call()` into C++ Pull Request resolved: https://github.com/pytorch/pytorch/pull/143063 Approved by: https://github.com/jansel ghstack dependencies: #143057, #143062	2024-12-22 06:38:46 +00:00
Aaron Orenstein	3ec04d30d5	dynamo tracing perf: kill import: 50.36 -> 49.12 (#143062 ) See #143056 for overall docs. This PR: Stop importing in the body of `BuiltinVariable.call_getattr()` Pull Request resolved: https://github.com/pytorch/pytorch/pull/143062 Approved by: https://github.com/jansel ghstack dependencies: #143057	2024-12-22 06:38:46 +00:00
Aaron Orenstein	f2b744b9ca	dynamo tracing perf: import_module: 59.92 -> 52.9 (#143057 ) See #143056 for overall docs. This PR: Using `importlib.import_module()` within the hot path of symbolic_convert is slow. Memoize it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143057 Approved by: https://github.com/jansel	2024-12-22 06:38:38 +00:00
Tom Ritchford	f1cbf4b1b5	Enable ruff's unused variable checking everywhere in pytorch (#136965 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136965 Approved by: https://github.com/cyyever, https://github.com/albanD	2024-12-22 02:33:11 +00:00
Xuehai Pan	2293fe1024	[BE][Easy] use `pathlib.Path` instead of `dirname` / `".."` / `pardir` (#129374 ) Changes by apply order: 1. Replace all `".."` and `os.pardir` usage with `os.path.dirname(...)`. 2. Replace nested `os.path.dirname(os.path.dirname(...))` call with `str(Path(...).parent.parent)`. 3. Reorder `.absolute()` ~/ `.resolve()`~ and `.parent`: always resolve the path first. `.parent{...}.absolute()` -> `.absolute().parent{...}` 4. Replace chained `.parent x N` with `.parents[${N - 1}]`: the code is easier to read (see 5.) `.parent.parent.parent.parent` -> `.parents[3]` 5. ~Replace `.parents[${N - 1}]` with `.parents[${N} - 1]`: the code is easier to read and does not introduce any runtime overhead.~ ~`.parents[3]` -> `.parents[4 - 1]`~ 6. ~Replace `.parents[2 - 1]` with `.parent.parent`: because the code is shorter and easier to read.~ Pull Request resolved: https://github.com/pytorch/pytorch/pull/129374 Approved by: https://github.com/justinchuby, https://github.com/malfet	2024-12-21 22:08:01 +00:00
PyTorch MergeBot	197954e14b	Revert "Handle meta tensors in FX quantization (#142262 )" This reverts commit e97b97af56204230f1030bd297dda9bc6b053a4c. Reverted https://github.com/pytorch/pytorch/pull/142262 on behalf of https://github.com/janeyx99 due to this PR broke lint ([comment](https://github.com/pytorch/pytorch/pull/142262#issuecomment-2558233022))	2024-12-21 20:34:09 +00:00
Yanan Cao (PyTorch)	0666347fc4	[Codemod][AddExplicitStrictExportArg] caffe2/benchmarks/dynamo (#143686 ) Reviewed By: avikchaudhuri Pull Request resolved: https://github.com/pytorch/pytorch/pull/143686 Approved by: https://github.com/tugsbayasgalan	2024-12-21 19:56:56 +00:00
Kaustubh Vartak	e97b97af56	Handle meta tensors in FX quantization (#142262 ) Summary: If module being quantized contains a some meta tensors and some tensors with actual device, we should not fail quantization. Quantization should also not fail if new quantized module is created on a meta device. Differential Revision: D66895899 Pull Request resolved: https://github.com/pytorch/pytorch/pull/142262 Approved by: https://github.com/iamzainhuda	2024-12-21 13:19:30 +00:00
cyy	daa3ffe0eb	Enable more C++ warnings (#143355 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/143355 Approved by: https://github.com/albanD	2024-12-21 09:19:02 +00:00
PyTorch MergeBot	e15442a9b2	Revert "export AOTI_TORCH_EXPORT on Windows. (#140030 )" This reverts commit 6733045a4aaef7a8d9fb1f9f8b80f4f5f4ef1f4f. Reverted https://github.com/pytorch/pytorch/pull/140030 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but my first attempt to fix internal build does not fix all the cases, so let us try again ([comment](https://github.com/pytorch/pytorch/pull/140030#issuecomment-2558043056))	2024-12-21 08:06:19 +00:00
Avik Chaudhuri	51eacea8c4	graph module retracing without preserving MCS (#143676 ) Retracing while preserving module call signatures used to be a problem because graph modules don't have submodules at given paths. This led to a number of failing retracebility tests. By not trying to wrap modules with export tracepoints we can pass most of these tests; the only exception is where you do module swapping on retraced programs, which is still not possible. Differential Revision: [D67539304](https://our.internmc.facebook.com/intern/diff/D67539304/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143676 Approved by: https://github.com/zhxchen17, https://github.com/tugsbayasgalan ghstack dependencies: #143664	2024-12-21 07:57:43 +00:00
cyy	d7e59c2f85	Fix cppcoreguidelines-pro-type-member-init (#141787 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/141787 Approved by: https://github.com/albanD	2024-12-21 07:51:30 +00:00
Basil Wong	7b2af25f80	[1/n] Support Dynamic Memory Budget in Auto AC (#143539 ) # Summary: Full Context: https://docs.google.com/document/d/1-j5KSbfGFJQcH4sYh7BIeJXso3zYzl5G5yFQqXdKx_o/edit?usp=sharing tl;dr This change introduces classes which help determine a dynamic memory budget. This will mostly be helpful for models with many implicit graph breaks. --- New Classes: GraphInfoProvider * Takes the joint_graph as well as the input memories and runtimes and parses the graph + values into usable forms for the SolverEvaluator. KnapsackEvaluator * Provides a function: Given all of the four inputs (solver function as a callable, max_dynamic_memory_budget, min_dynamic_memory_budget, dynamic_memory_budget_pareto_granularity) it returns an approximation of the knee point of the pareto distribution. # Test Plan: ### LintRunner LintRunner Output: P1700445547 ### Unit Tests ``` $ buck test @mode/opt //caffe2/test/functorch:test_ac_knapsack `@mode/opt` was specified, but not found. Using file at `//mode/opt`. This behavior is being deprecated. Please use `"@//mode/opt"` instead File changed: fbcode//caffe2/.ruff_cache/0.7.4/.tmpB6PmDS File changed: fbsource//xplat/caffe2/test/functorch/test_ac_knapsack.py File changed: fbcode//caffe2/.ruff_cache/0.7.4/.tmpyjCiPn 20 additional file change events Buck UI: https://www.internalfb.com/buck2/414ead46-9ede-4192-8e1a-5d3c52bdb9cc Test UI: https://www.internalfb.com/intern/testinfra/testrun/6473924710342830 Network: Up: 0B Down: 0B (reSessionID-159794b9-9d61-477e-8e63-9bdeaa537dca) Analyzing targets. Remaining 0/214 Executing actions. Remaining 0/6933 0.1s exec time total Command: test. Finished 1 local Time elapsed: 18.5s Tests finished: Pass 15. Fail 0. Fatal 0. Skip 0. Build failure 0 ``` ### Test Run Updated the config: ``` activation_memory_budget_solver: DYNAMIC_MEMORY_BUDGET_DP ``` Confirming proper execution via: [aps-fb_fm_v4_768_01_dynamic-2a792ba8af](https://www.internalfb.com/mlhub/pipelines/runs/mast/aps-fb_fm_v4_768_01_dynamic-2a792ba8af?job_attempt=0&version=0&env=PRODUCTION) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143539 Approved by: https://github.com/jansel	2024-12-21 07:38:52 +00:00
PyTorch MergeBot	bee47b0663	Revert "[pytorch/et] Allow ET to save additional resources for completing a trace like generated kernels and index tensor data (#143430 )" This reverts commit 33dd4f187dd3b54d65182d56998feae235ee48c7. Reverted https://github.com/pytorch/pytorch/pull/143430 on behalf of https://github.com/huydhn due to The internal diff D58707846 has been backed out ([comment](https://github.com/pytorch/pytorch/pull/143430#issuecomment-2558033930))	2024-12-21 07:26:34 +00:00
PyTorch UpdateBot	47c4e01e71	[audio hash update] update the pinned audio hash (#143694 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned audio hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143694 Approved by: https://github.com/pytorchbot	2024-12-21 05:42:34 +00:00
Richard Barnes	9f3c291bc3	Fix issue with setAttribute and int8_t vs int32_t variables (#143693 ) Test Plan: Sandcastle Pull Request resolved: https://github.com/pytorch/pytorch/pull/143693 Approved by: https://github.com/huydhn	2024-12-21 05:31:56 +00:00
Richard Barnes	518b5050c0	Fix unused-variable issues in caffe2 (#143639 ) Summary: LLVM-15 has a warning `-Wunused-variable` which we treat as an error because it's so often diagnostic of a code issue. Unused variables can compromise readability or, worse, performance. This diff either (a) removes an unused variable and, possibly, it's associated code or (b) qualifies the variable with `[[maybe_unused]]`. - If you approve of this diff, please use the "Accept & Ship" button :-) Test Plan: Sandcastle Pull Request resolved: https://github.com/pytorch/pytorch/pull/143639 Approved by: https://github.com/kit1980, https://github.com/malfet, https://github.com/cyyever	2024-12-21 05:27:38 +00:00
eellison	f44310097c	Reuse partial reductions (#143600 ) Reuse partial reductions for complete reductions. We could expand this to more cover more types of reductions, although we'd have to be a bit more careful about keeping the intermediary, partial reduction in higher precision. Just doing the ops which do not depend on a higher compute_dtype_precision for now to cover the relevant use case initially. Fix for https://github.com/pytorch/pytorch/issues/136267. Longer term, we should make sure cooperative reductions fuse partial and complete reductions. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143600 Approved by: https://github.com/vkuzo	2024-12-21 04:44:07 +00:00
PyTorch MergeBot	97990f476d	Revert "Fix unused-variable issues in caffe2 (#143639 )" This reverts commit 23ca7c2515dd1f601926c4fd0e65513308c135a9. Reverted https://github.com/pytorch/pytorch/pull/143639 on behalf of https://github.com/huydhn due to This is failing OSS tests ([comment](https://github.com/pytorch/pytorch/pull/143639#issuecomment-2557991297))	2024-12-21 04:30:48 +00:00
PyTorch MergeBot	b89bfe0bac	Revert "Fix issue with setAttribute and int8_t vs int32_t variables (#143693 )" This reverts commit ae3d385fcba0f91f35b2848b852d4c75f88cbd62. Reverted https://github.com/pytorch/pytorch/pull/143693 on behalf of https://github.com/huydhn due to Sorry for reverting this change but it has a conflict with https://github.com/pytorch/pytorch/pull/143639 that is breaking trunk ([comment](https://github.com/pytorch/pytorch/pull/143693#issuecomment-2557990508))	2024-12-21 04:27:18 +00:00
Simon Fan	a8953c36f5	[compiled autograd] log compilation time to perfetto (#140964 ) https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmprli4iy/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=100 ``` [ { "args": { "compile_id": "0/-/-", "graph_id": 0 }, "cat": "dynamo_timed", "name": "compiled_autograd", "ph": "B", "pid": 0, "tid": 0, "ts": 1733886868992655.8 }, { "args": { "compile_id": "0/-/-", "graph_id": 0 }, "cat": "dynamo_timed", "name": "compiled_autograd", "ph": "E", "pid": 0, "tid": 0, "ts": 1733886869130681.0 }, { "args": { "compile_id": "0/0/0" }, "cat": "dynamo_timed", "name": "dynamo", "ph": "B", "pid": 0, "tid": 0, "ts": 1733886869134350.5 }, { ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/140964 Approved by: https://github.com/masnesral ghstack dependencies: #141907, #143175	2024-12-21 04:23:25 +00:00
PyTorch MergeBot	c7d7eff798	Revert "[MTIA] (3/n) Implement PyTorch APIs to query/reset device peak memory usage (#143347 )" This reverts commit efe21ee59dfdd6642cc693e69e07aa9d8be13eb9. Reverted https://github.com/pytorch/pytorch/pull/143347 on behalf of https://github.com/huydhn due to D67118173 has been backed out internally ([comment](https://github.com/pytorch/pytorch/pull/143347#issuecomment-2557983266))	2024-12-21 04:04:16 +00:00
PyTorch MergeBot	dabc9566c4	Revert "(MTIA) Move "empty_cache" API (#143402 )" This reverts commit c7d9f298072a3f59b39517e367c7d3d2ea30e6d9. Reverted https://github.com/pytorch/pytorch/pull/143402 on behalf of https://github.com/huydhn due to The internal diff D67148738 has been reverted ([comment](https://github.com/pytorch/pytorch/pull/143402#issuecomment-2557982597))	2024-12-21 04:01:23 +00:00
Bin Bao	fecf03fa3f	[AOTI][reland] Emit a CMakeLists.txt when package_cpp_only (#143680 ) Summary: Emit a CMakeLists.txt with compile and link options when package_cpp_only is specified. After unzipping AOTI generated .pt2 package file, user can manually build the generated model code in their local environment. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143680 Approved by: https://github.com/huydhn	2024-12-21 03:48:40 +00:00
xinan.lin	b5e159270a	[AOTI XPU] Replace intel compiler with g++ to build inductor CPP wrapper in runtime. (#142322 ) This PR aims to removes the de pendency on Intel Compiler at Inductor runtime. Now we only need a SYCL_HOME in runtime to find the sycl headers and libs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/142322 Approved by: https://github.com/EikanWang, https://github.com/desertfire, https://github.com/albanD ghstack dependencies: #143491	2024-12-21 02:27:04 +00:00
xinan.lin	af0e159740	[Inductor XPU] Add XPU check for `is_big_gpu()`. (#143491 ) Fix #143472 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143491 Approved by: https://github.com/desertfire, https://github.com/jansel, https://github.com/EikanWang	2024-12-21 02:27:04 +00:00
Animesh Jain	0da004f3dd	[dynamo] Remove transformers ModelOutput hack (#143567 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143567 Approved by: https://github.com/williamwen42, https://github.com/jansel ghstack dependencies: #143548	2024-12-21 01:46:14 +00:00
Animesh Jain	4627cfd1f9	[dynamo] Support user defined dicts (#143548 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143548 Approved by: https://github.com/yanboliang, https://github.com/jansel, https://github.com/williamwen42	2024-12-21 01:46:14 +00:00
James Wu	9cb743d1f9	[easy] Set feature use for aot autograd remote cache (#143674 ) Use set_feature_use for logging aot autograd cache so that dynamo_compile has this data as well as PT2 Compile Events. Differential Revision: [D67536293](https://our.internmc.facebook.com/intern/diff/D67536293/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143674 Approved by: https://github.com/bobrenjc93	2024-12-21 01:40:18 +00:00
Simon Fan	ffd1b53f26	[aot] refactor dynamo source and cudagraphs static idx logic (#141748 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/141748 Approved by: https://github.com/ezyang	2024-12-21 01:20:53 +00:00
Richard Barnes	ae3d385fcb	Fix issue with setAttribute and int8_t vs int32_t variables (#143693 ) Test Plan: Sandcastle Differential Revision: D67549758 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143693 Approved by: https://github.com/huydhn	2024-12-21 01:19:29 +00:00
Avik Chaudhuri	bdeee82822	unflatten isinstance (#143664 ) When we unflatten, the submodules we generate (`InterpreterModule` or `InterpreterModuleDispatcher`) are not related by type to the original submodules `N`. This makes `isinstance(mod, N)` checks fail. Since we do not have the original types after export, the best we can do is expose a `type_name()` method that carries the original type name, which we do carry in `nn_module_stack` entries. Differential Revision: [D67526542](https://our.internmc.facebook.com/intern/diff/D67526542/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143664 Approved by: https://github.com/tugsbayasgalan	2024-12-21 01:07:10 +00:00
Simon Fan	d88ebbf822	cleanup chromium event log on dynamo exit rather than on entry (#143175 ) clearing at dynamo start is an issue because it throws away events from compiled autograd Pull Request resolved: https://github.com/pytorch/pytorch/pull/143175 Approved by: https://github.com/Skylion007, https://github.com/jamesjwu ghstack dependencies: #141907	2024-12-21 00:41:24 +00:00
Simon Fan	4ee166b82f	[ca] add compiled autograd to CompileId (#141907 ) tlparse PR: https://github.com/ezyang/tlparse/pull/83 Pull Request resolved: https://github.com/pytorch/pytorch/pull/141907 Approved by: https://github.com/ezyang	2024-12-21 00:41:24 +00:00
Tugsbayasgalan Manlaibaatar	0ce233b8ca	Support tensor subclass unwrapping (#141941 ) This PR adds support for export to unwrap/wrap subclasses AOT so that we can trace through subclass parameters. This will resolve the UX issue in torchao where users had to manually unwrap their subclasses before calling export. Differential Revision: [D67531057](https://our.internmc.facebook.com/intern/diff/D67531057) Pull Request resolved: https://github.com/pytorch/pytorch/pull/141941 Approved by: https://github.com/bdhirsh	2024-12-21 00:29:31 +00:00
Nikita Shulga	553031fb9a	[BE] Remove gcc-5 workaround for unused args (#143685 ) ditto Pull Request resolved: https://github.com/pytorch/pytorch/pull/143685 Approved by: https://github.com/kit1980, https://github.com/seemethere, https://github.com/atalman	2024-12-21 00:18:15 +00:00
PyTorch MergeBot	ad7ab5ef84	Revert "[logging] A few fixes/updates to record_compilation_metrics (#143332 )" This reverts commit a9c753bbc88bfdc0e77f66956b3a11e405235d0f. Reverted https://github.com/pytorch/pytorch/pull/143332 on behalf of https://github.com/malfet due to Surprisingly failure is caused by this PR ([comment](https://github.com/pytorch/pytorch/pull/143332#issuecomment-2557899120))	2024-12-21 00:06:44 +00:00
Will Feng	bf7009d839	[rpc] Fix unit test after c10::nullopt removal (#143690 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143690 Approved by: https://github.com/yifuwang, https://github.com/c-p-i-o, https://github.com/XilunWu	2024-12-20 23:36:07 +00:00
eqy	912d6a2867	[CUDA] Bump tolerances in `test_svd_lowrank_cuda_float64` (#143049 ) pre-emptive bump for apparent noisy failure Pull Request resolved: https://github.com/pytorch/pytorch/pull/143049 Approved by: https://github.com/Skylion007, https://github.com/lezcano, https://github.com/nikitaved	2024-12-20 23:25:21 +00:00
Michael Lazos	8960cb5809	Add support for bfloat16 atomic adds in fbcode (#143629 ) Reland https://github.com/pytorch/pytorch/pull/141857 and fallback on A100 which doesn't have bfloat16 atomic add instrs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143629 Approved by: https://github.com/eellison	2024-12-20 23:05:13 +00:00
amdfaa	a3b04d473e	[ROCm] Update setup-rocm for almalinux-based images (#143590 ) Needed for https://github.com/pytorch/test-infra/pull/6003 and https://github.com/pytorch/ao/pull/999 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143590 Approved by: https://github.com/atalman Co-authored-by: Jithun Nair <37884920+jithunnair-amd@users.noreply.github.com>	2024-12-20 22:48:54 +00:00
Richard Barnes	23ca7c2515	Fix unused-variable issues in caffe2 (#143639 ) Summary: LLVM-15 has a warning `-Wunused-variable` which we treat as an error because it's so often diagnostic of a code issue. Unused variables can compromise readability or, worse, performance. This diff either (a) removes an unused variable and, possibly, it's associated code or (b) qualifies the variable with `[[maybe_unused]]`. - If you approve of this diff, please use the "Accept & Ship" button :-) Test Plan: Sandcastle Pull Request resolved: https://github.com/pytorch/pytorch/pull/143639 Approved by: https://github.com/kit1980, https://github.com/malfet	2024-12-20 22:30:58 +00:00
Tristan Rice	6e58c37542	c10d: no call_guard in init (#143598 ) `py::call_guard<py::gil_scoped_release>` is not safe when using multiple threads. This instead moves it into the init function which is safe. For more details see #143593 https://github.com/pybind/pybind11/issues/5473 Test plan: ``` python setup.py develop ``` CI ```py import time from concurrent.futures import ThreadPoolExecutor from torch import distributed as dist def run(): store = dist.TCPStore( host_name="localhost", port=0, is_master=True, wait_for_workers=False, ) # this sleep is required to trigger the crash time.sleep(0.1) del store futures = [] with ThreadPoolExecutor( max_workers=100, ) as executor: for i in range(100000): print(i) futures.append(executor.submit(run)) if len(futures) > 100: futures.pop(0).result() ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/143598 Approved by: https://github.com/c-p-i-o	2024-12-20 22:23:36 +00:00
Sam Larsen	a9c753bbc8	[logging] A few fixes/updates to record_compilation_metrics (#143332 ) Summary: Mostly cosmetic, but one bug fix: * Bug fix: Make sure compile_id is converted to a string in the compilation metrics so it's printed as, e.g., "0/1" instead of "[0, 1]" * Sort collections in `collection_to_str` * Print non-string elements as `"<unknown>"` instead of None (since we don't expect non-strings) * Move the population of the legacy metrics and any pre-processing to a new factory method in CompilationMetrics Test Plan: ``` python test/dynamo/test_structured_trace.py python test/dynamo/test_utils.py ``` Internal testing: https://fburl.com/scuba/dynamo_compile/sandbox/l0me8auf Pull Request resolved: https://github.com/pytorch/pytorch/pull/143332 Approved by: https://github.com/ppanchalia	2024-12-20 21:42:32 +00:00
Mikayla Gawarecki	372b023eb1	Fix test_serialization_zipfile_actually_jit when weights_only is not default (#143668 ) Fails in fbcode where weights_only isn't default Pull Request resolved: https://github.com/pytorch/pytorch/pull/143668 Approved by: https://github.com/awgu ghstack dependencies: #143326, #143403	2024-12-20 21:25:10 +00:00
Darshan Sanghani	33dd4f187d	[pytorch/et] Allow ET to save additional resources for completing a trace like generated kernels and index tensor data (#143430 ) The resources directory lets ET observer dump any additional data like Triton kernels while capturing the ET. This allows us to use the ET trace to replay PT2 workloads and get visibility into data like generated kernels and their usage in a model, index tensor data etc. We also added a few ways to enable ET and ET Resources through the OS environment variables. Setting `ENABLE_PYTORCH_EXECUTION_TRACE` will enable default Execution Tracing in Pytorch. Additionally setting `ENABLE_PYTORCH_EXECUTION_TRACE_EXTRAS` will enable ET to collect extra resources from the ET run like Triton Kernels. Differential Revision: [D58707846](https://our.internmc.facebook.com/intern/diff/D58707846/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143430 Approved by: https://github.com/shengfukevin, https://github.com/sraikund16	2024-12-20 21:20:32 +00:00
zeshengzong	cee06e74ee	Apply clang-format for ATen/core/dispatch headers (#143620 ) Code change via add path config in `.lintrunner.toml` file and running ```bash $ lintrunner -a --take CLANGFORMAT --all-files ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/143620 Approved by: https://github.com/malfet	2024-12-20 21:16:23 +00:00
Mikayla Gawarecki	8e483654cb	Add config.save.use_pinned_memory_for_d2h to serialization config (#143342 ) This was benchmarked with two separate scripts on my A100 (A) Save state_dict of llama3-style model on CUDA to disk with ``torch.save`` (B) Save `ModuleList` of 10 `nn.Linear(10,000, 10,000)` on CUDA to disk with `torch.save` Timings are an average of 5 runs and benchmark scripts + results are attached Under both scenarios, we see ~2x speedup in ``torch.save`` time with (``compute_crc32=False`` and ``use_pinned_memory_for_d2h=True``) compared to the baseline of the current defaults (``compute_crc32=True`` and ``use_pinned_memory_for_d2h=False`` (A) Save state_dict of llama3-style model on CUDA to disk with ``torch.save`` [[script](https://gist.github.com/mikaylagawarecki/d3a86ea1bb08045d1a839976808d7432)][[results](https://gist.github.com/mikaylagawarecki/f61a4714e5cff703146a1fcb7e0c755c)] \| \| use_pinned_memory_for_d2h=False (Default) \| use_pinned_memory_for_d2h=True \| \|-\|-\|-\| \| `compute_crc_32= True` (Default)\| 28.54s \| 20.76s \| \| `compute_crc_32 = False` \| 22.57s \| 14.51s \| (B) Save `ModuleList` of 10 `nn.Linear(10,000, 10,000)` on CUDA to disk with `torch.save` [[script](https://gist.github.com/mikaylagawarecki/ecbc505436bdd4b5190ef1b3430c12b6)][[results](https://gist.github.com/mikaylagawarecki/4e686bcf030b57de8c3ca74d8f5a88f7)] \| \| use_pinned_memory_for_d2h=False (Default) \| use_pinned_memory_for_d2h=True \| \|-\|-\|-\| \| `compute_crc_32= True` (Default)\| 8.38s \| 5.53s \| \| `compute_crc_32 = False` \| 6.94s \| 3.99s \| Trace of (A) with `use_pinned_memory_for_d2h=True`, `compute_crc32=False` <img width="1745" alt="Screenshot 2024-12-16 at 7 32 33 PM" src="https://github.com/user-attachments/assets/80b87a8c-5a70-4eb9-ad66-7abc4aa7cc25" /> Baseline trace of (A) with `use_pinned_memory_for_d2h=False`, `compute_crc32=True` <img width="1799" alt="Screenshot 2024-12-16 at 7 38 20 PM" src="https://github.com/user-attachments/assets/13fa12d1-8f5f-424c-adc4-275b67012927" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/143342 Approved by: https://github.com/albanD ghstack dependencies: #143324	2024-12-20 21:01:18 +00:00
Mikayla Gawarecki	3f63b742e6	Refactor serialization getter/setters into torch.utils.serialization.config (#143324 ) Consolidate - get/set_default_load_endianness - get/set_default_mmap_options - get/set_crc32_options into one global dynamo-style config + allow global setting of mmap. The existing APIs are not removed and will get/set from the config (as they can't be removed for BC) In #143459 I add the local (argument style) config Pull Request resolved: https://github.com/pytorch/pytorch/pull/143324 Approved by: https://github.com/albanD	2024-12-20 21:01:17 +00:00
Scott Wolchok	629de988df	Fix old-compiler-unfriendly zero init of bfloat16_t array (#143504 ) clang versions before 17 don't like to assign 0 to a bfloat16_t. gcc versions before 13 also won't assign 0.0 to a bfloat16_t. (Citation: https://godbolt.org/z/Gzs5ebdej) Differential Revision: [D67396740](https://our.internmc.facebook.com/intern/diff/D67396740/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143504 Approved by: https://github.com/malfet	2024-12-20 20:49:51 +00:00
Chirag Pandya	485497e727	[c10d][fr] flight recorder improvements (#143446 ) Summary: 1. Flight recorder dumps are now automatically dumped by default upon timeout or exception. Users don't need to opt-in. 2. Change default dump location to running user's home directory `.cache` folder. Test Plan: 1. Tested locally by running the crash program from flight recorder tutorial page. https://pytorch.org/tutorials/prototype/flight_recorder_tutorial.html#an-end-to-end-example 2. Noted that flight recorder files were correctly created. ❯ pwd /home/cpio/.cache/fr_trace ❯ ls nccl_trace_rank_0 nccl_trace_rank_1 Differential Revision: [D67363720](https://our.internmc.facebook.com/intern/diff/D67363720) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143446 Approved by: https://github.com/d4l3k	2024-12-20 20:41:30 +00:00
Colin L. Rice	a94f259a69	pgo: Log feature use (#142819 ) This will cause dynamo_compile to popualte the feature column if we have a hit for PGO. Pull Request resolved: https://github.com/pytorch/pytorch/pull/142819 Approved by: https://github.com/ezyang	2024-12-20 20:22:20 +00:00
Aaron Orenstein	8ce0bc282a	dynamo tracing perf: bytecode_transform improvements: 34.86 -> 33.9 (#143068 ) See #143056 for overall docs. This PR: Use slots on InstructionExnTabEntry and Instruction. Stop doing python version checks in the middle of `convert_instruction()` and `inst_has_op_bits()`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143068 Approved by: https://github.com/jansel ghstack dependencies: #143065, #143067	2024-12-20 20:06:42 +00:00
Aaron Orenstein	5feb2d7b41	dynamo tracing perf: don't call expensive _set_guard_export_info if it's a duplicate guard: 37.66 -> 34.86 (#143067 ) See #143056 for overall docs. This PR: Move the call to `_set_guard_export_info()` after the duplicate guard check in `GuardBuilder.DUPLICATE_INPUT()` Pull Request resolved: https://github.com/pytorch/pytorch/pull/143067 Approved by: https://github.com/jansel ghstack dependencies: #143065	2024-12-20 20:06:42 +00:00
Aaron Orenstein	7d4e7fbfc1	dynamo tracing perf: no import on hot path: 47.62 -> 47.26 (#143065 ) See #143056 for overall docs. This PR: Removed another `import` in the body of the hot path. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143065 Approved by: https://github.com/jansel	2024-12-20 20:06:42 +00:00
Yanbo Liang	792e6184c5	[GPT-fast] Support run spcific model or micro-benchmark (#143607 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143607 Approved by: https://github.com/BoyuanFeng, https://github.com/jerryzh168, https://github.com/huydhn	2024-12-20 19:58:07 +00:00
Nikhil Gupta	94737e8a2a	[ARM][feat]: Add 4 bit dynamic quantization matmuls & KleidiAI Backend (#134124 ) Description: 1. Quantize Linear Layer Weights to 4-bits: Quantize the weights of the Linear layer to 4 bits, using symmetric quantization. Pack two 4-bit weights into one uint8 container. Choose a quantization scheme (channel-wise or group-wise), with the group size being a multiple of 32. 2. Prepare Quantized Weights, Scales, and Optional Bias: After quantizing, obtain the quantized_weights, scales, and groupsize. If the original Linear layer has a bias, prepare it as well. 3. Pack the Weights Efficiently: Use torch.ops.aten._dyn_quant_pack_4bit_weight to optimally pack the weights, scales, and optional bias. ```python packed_weights = torch.ops.aten._dyn_quant_pack_4bit_weight(weight, scales_and_zeros, bias, groupsize, in_features, out_features) ``` Input parameters should include: in_features and out_features (the same as the Linear layer’s corresponding parameters). 4. Perform Dynamic Quantized Matrix Multiplication: Use torch.ops.aten._dyn_quant_matmul_4bit to perform matrix multiplication with quantized weights. ```python output = torch.ops.aten._dyn_quant_matmul_4bit(input, packed_weights, groupsize, in_features, out_features) ``` Inputs required include: The input tensor, packed_weights , groupsize, and the in_features and out_features. API Usage: https://github.com/pytorch/pytorch/issues/143289 Model Perf : 7B Transformer model: Prefill : 340 t/s Decode : 40 t/s 2B Transformer model Prefill : 747 t/s Decode : 80 t/s Tests: python test/test_linalg.py -k test__dyn_quant_pack_4bit_weight Ran 1 test in 0.016s OK python test/test_linalg.py -k test__dyn_quant_matmul_4bit Ran 8 tests in 0.077s OK python test/test_linalg.py -k test_compile_dyn_quant_matmul_4bit Ran 8 tests in 11.454s Change-Id: Ia1672bad5e6ec94e64d8bb1971395d60f4b3a452 Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/134124 Approved by: https://github.com/digantdesai, https://github.com/malfet	2024-12-20 19:32:03 +00:00
Tom Ritchford	b5475d334e	[inductor] Fix an unused variable in cpu_vec_isa.py (#138473 ) ---- * Extracted from https://github.com/pytorch/pytorch/pull/133492 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138473 Approved by: https://github.com/EikanWang, https://github.com/albanD, https://github.com/xuhancn	2024-12-20 18:50:19 +00:00
Nikita Shulga	5a69c2a649	[BE][Sparse] Get rid of gcc-5 workaround (#143653 ) Discovered those comments while looking at https://github.com/pytorch/pytorch/pull/143620 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143653 Approved by: https://github.com/albanD	2024-12-20 18:40:45 +00:00
Joy Dong	a5ed499f6a	FlexAttention Benchmark (#139665 ) 1. Add alibi, sliding window, tahn softcap, prefixLM, and document_mask from attn_gym to benchmark. 2. Add comparison to different SDPA backends & FAv2, FAv3, FAKV. Dependent on https://github.com/pytorch/pytorch/pull/139639 Pull Request resolved: https://github.com/pytorch/pytorch/pull/139665 Approved by: https://github.com/drisspg	2024-12-20 17:52:24 +00:00
Hyunho Yeo	c7d9f29807	(MTIA) Move "empty_cache" API (#143402 ) Summary: This diff moves one of memory-related APIs to the consolidated location, which is `mtia/memory.py`. Test Plan: ``` buck2 test //mtia/host_runtime/torch_mtia/tests:test_torch_mtia_api ``` https://www.internalfb.com/intern/testinfra/testrun/13510798943184259 Reviewed By: nautsimon Differential Revision: D67148738 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143402 Approved by: https://github.com/nautsimon	2024-12-20 17:39:06 +00:00
Colin L. Rice	d79fbf6b6d	test/dynamo/test_utils: logging - Stop testing for impossible things. (#143535 ) We don't support assigning to objects or numeric constants at the top level in config modules, no need to test for them. (This specifically breaks later sorting refactoring, since it requires < to be implemented). Pull Request resolved: https://github.com/pytorch/pytorch/pull/143535 Approved by: https://github.com/ppanchalia	2024-12-20 17:21:49 +00:00
Huamin Li	f5af87c23c	Make Inductor cpp backend enable_floating_point_contract_flag to take string (#143450 ) Differential Revision: D66269001 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143450 Approved by: https://github.com/desertfire	2024-12-20 16:28:54 +00:00
William Wen	7ab880bc5e	fix typo in autocast header (#143625 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143625 Approved by: https://github.com/mlazos ghstack dependencies: #143592	2024-12-20 16:17:15 +00:00
bobrenjc93	4f8b7c4272	Revert "refactor tensorify restart logic to use sources (#141517 )" (#143623 ) This reverts commit 30d8b30db7eaaa254d97077ac6515cdc4568fd6d. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143623 Approved by: https://github.com/mlazos	2024-12-20 15:38:34 +00:00
leslie-fang-intel	607884c9af	[Inductor][CPP] Fix bitwise shift with corner inputs (#143635 ) Summary Fix issue https://github.com/pytorch/pytorch/issues/143555 and https://github.com/pytorch/pytorch/issues/143566, we can align the implementation with Eager: `29b586bbad/aten/src/ATen/native/cpu/BinaryOpsKernel.cpp (L501)` at these corner inputs. Test Plan ``` python test/inductor/test_cpu_repro.py -k test_bitwise_shift_corner_inputs ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/143635 Approved by: https://github.com/jgong5	2024-12-20 13:47:40 +00:00
Guilherme Leobas	7bf3b7cdc5	Rewrite _reparametrize_module to use `contextmanager` (#138203 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138203 Approved by: https://github.com/zou3519 ghstack dependencies: #136033, #140604	2024-12-20 12:02:27 +00:00
Guilherme Leobas	1c817fe671	Set `enable_trace_contextlib_contextmanager` flag to True (#140604 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/140604 Approved by: https://github.com/zou3519 ghstack dependencies: #136033	2024-12-20 12:02:27 +00:00
Guilherme Leobas	673cc88fd6	Add support for `contextmanager` in Dynamo (#136033 ) Fixes #130559 * Intro This PR adds support for `@contextmanager` in Dynamo. We chose to limit the scope of this work to only `@contextmanager` and plan to handle generators fully in #141055 (still in draft). * Motivation Dynamo lacks support for generator functions. When it encounters one, it traces it as if it were a regular function. This is problematic because it can lead to incorrect behavior. To illustrate, consider the test case below: ```python import torch import contextlib @contextlib.contextmanager def set_default_dtype(dtype): old_dtype = torch.get_default_dtype() try: torch.set_default_dtype(dtype) yield finally: torch.set_default_dtype(old_dtype) @torch.compile(backend="eager", fullgraph=True) def fn(): with set_default_dtype(torch.float64): x = torch.tensor([3.0, 3.0 + 5.0j]) return x ``` Before this work, Dynamo would not stop at the `yield`, and the graph produced would contain both calls to `set_default_dtype` executed one after the other. This is incorrect because the context manager should execute code before and after the `yield`. * List of changes `YIELD_VALUE` now raises an exception (`YieldValueOp`) to signal that control flow must be suspended and returned to the caller. Additionally, `RETURN_VALUE` behaves differently in a generator function. Unlike regular functions, where `RETURN_VALUE` indicates the final result, in generators it signifies that the generator is exhausted and implicitly raises `StopIteration`. A new `VariableTracker` named `FunctionDecoratedByContextlibContextManagerVariable` was introduced to handle `@contextmanager`. This variable tracker acts not just as a wrapper for the original function but also maintains an internal `tx` (InstructionTranslator) object to suspend and return control flow to the parent tracer when a `yield` is encountered. * Corner cases Returning a context manager from a compiled function is not supported. This would require PyTorch to synchronize the generator state between Dynamo and the interpreter. Any attempt to return it will result in an `IncorrectUsage` exception. Graph breaks require special handling as well. In the event of a graph break, the frame associated with the context manager is skipped, and the context manager runs in eager mode. * This PR is breaking my code There is a configuration flag (`enable_trace_contextlib`) that can be set to `False` to disable tracing context managers. If this still causes crashes, please revert this PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136033 Approved by: https://github.com/zou3519	2024-12-20 12:02:20 +00:00
Jason Ansel	04b26ee1e8	Fix false positive from f-strings in set_linter (#143628 ) This linter was going crazy in python 3.12, example: ```py $ python3 tools/linter/adapters/set_linter.py torch/_inductor/runtime/triton_heuristics.py torch/_inductor/runtime/triton_heuristics.py:192:25: Builtin `set` is deprecated 190 \| args_str += ", ".join(call_args) 191 \| for k, v in call_kwargs.items(): 192 \| args_str += f", {k}={v}" ^ 193 \| 194 \| abs_path = os.path.abspath(sys.argv[0]) torch/_inductor/runtime/triton_heuristics.py:192:27: Builtin `set` is deprecated 190 \| args_str += ", ".join(call_args) 191 \| for k, v in call_kwargs.items(): 192 \| args_str += f", {k}={v}" ^ 193 \| 194 \| abs_path = os.path.abspath(sys.argv[0]) torch/_inductor/runtime/triton_heuristics.py:192:29: Builtin `set` is deprecated 190 \| args_str += ", ".join(call_args) 191 \| for k, v in call_kwargs.items(): 192 \| args_str += f", {k}={v}" ^ 193 \| 194 \| abs_path = os.path.abspath(sys.argv[0]) torch/_inductor/runtime/triton_heuristics.py:192:31: Builtin `set` is deprecated 190 \| args_str += ", ".join(call_args) 191 \| for k, v in call_kwargs.items(): 192 \| args_str += f", {k}={v}" ^ 193 \| 194 \| abs_path = os.path.abspath(sys.argv[0]) torch/_inductor/runtime/triton_heuristics.py:195:17: Builtin `set` is deprecated 193 \| 194 \| abs_path = os.path.abspath(sys.argv[0]) 195 \| with open(f"{abs_path}.launch_params", "a") as f: ^ 196 \| f.write(f"{kernel_name} \| {args_str}\n") 197 \| torch/_inductor/runtime/triton_heuristics.py:195:26: Builtin `set` is deprecated 193 \| 194 \| abs_path = os.path.abspath(sys.argv[0]) 195 \| with open(f"{abs_path}.launch_params", "a") as f: ^ 196 \| f.write(f"{kernel_name} \| {args_str}\n") 197 \| torch/_inductor/runtime/triton_heuristics.py:196:19: Builtin `set` is deprecated 194 \| abs_path = os.path.abspath(sys.argv[0]) 195 \| with open(f"{abs_path}.launch_params", "a") as f: 196 \| f.write(f"{kernel_name} \| {args_str}\n") ^ 197 \| 198 \| torch/_inductor/runtime/triton_heuristics.py:196:31: Builtin `set` is deprecated 194 \| abs_path = os.path.abspath(sys.argv[0]) 195 \| with open(f"{abs_path}.launch_params", "a") as f: 196 \| f.write(f"{kernel_name} \| {args_str}\n") ^ 197 \| 198 \| torch/_inductor/runtime/triton_heuristics.py:196:35: Builtin `set` is deprecated 194 \| abs_path = os.path.abspath(sys.argv[0]) 195 \| with open(f"{abs_path}.launch_params", "a") as f: 196 \| f.write(f"{kernel_name} \| {args_str}\n") ^ 197 \| 198 \| torch/_inductor/runtime/triton_heuristics.py:196:44: Builtin `set` is deprecated 194 \| abs_path = os.path.abspath(sys.argv[0]) 195 \| with open(f"{abs_path}.launch_params", "a") as f: 196 \| f.write(f"{kernel_name} \| {args_str}\n") ^ 197 \| 198 \| torch/_inductor/runtime/triton_heuristics.py:729:26: Builtin `set` is deprecated 727 \| exec( 728 \| f""" 729 \| def launcher({', '.join(def_args)}, grid, stream): ^ 730 \| if callable(grid): 731 \| grid_0, grid_1, grid_2 = grid(grid_meta) torch/_inductor/runtime/triton_heuristics.py:729:46: Builtin `set` is deprecated 727 \| exec( 728 \| f""" 729 \| def launcher({', '.join(def_args)}, grid, stream): ^ 730 \| if callable(grid): 731 \| grid_0, grid_1, grid_2 = grid(grid_meta) torch/_inductor/runtime/triton_heuristics.py:735:24: Builtin `set` is deprecated 733 \| grid_0, grid_1, grid_2 = grid 734 \| 735 \| args = {', '.join(call_args)}, ^ 736 \| launch_args = get_launch_args( 737 \| grid, grid_0, grid_1, grid_2, stream, function, torch/_inductor/runtime/triton_heuristics.py:735:45: Builtin `set` is deprecated 733 \| grid_0, grid_1, grid_2 = grid 734 \| 735 \| args = {', '.join(call_args)}, ^ 736 \| launch_args = get_launch_args( 737 \| grid, grid_0, grid_1, grid_2, stream, function, torch/_inductor/runtime/triton_heuristics.py:1144:20: Builtin `set` is deprecated 1142 \| cur_file = inspect.stack()[1].filename 1143 \| summary_str = ( 1144 \| f"SUMMARY ({cur_file})\n" ^ 1145 \| f"{overall_time:.2f}ms \t {overall_gb:.2f} GB\t {overall_gb / (overall_time / 1e3):.2f}GB/s" 1146 \| ) torch/_inductor/runtime/triton_heuristics.py:1144:29: Builtin `set` is deprecated 1142 \| cur_file = inspect.stack()[1].filename 1143 \| summary_str = ( 1144 \| f"SUMMARY ({cur_file})\n" ^ 1145 \| f"{overall_time:.2f}ms \t {overall_gb:.2f} GB\t {overall_gb / (overall_time / 1e3):.2f}GB/s" 1146 \| ) torch/_inductor/runtime/triton_heuristics.py:1162:61: Builtin `set` is deprecated 1160 \| ) 1161 \| file.write("====================\n") 1162 \| file.write(f"TRITON KERNELS BANDWIDTH INFO ({cur_file})\n") ^ 1163 \| for ms, num_gb, gb_per_s, kernel_name in sorted_calls: 1164 \| # also display the runtime percentage for each kernel torch/_inductor/runtime/triton_heuristics.py:1162:70: Builtin `set` is deprecated 1160 \| ) 1161 \| file.write("====================\n") 1162 \| file.write(f"TRITON KERNELS BANDWIDTH INFO ({cur_file})\n") ^ 1163 \| for ms, num_gb, gb_per_s, kernel_name in sorted_calls: 1164 \| # also display the runtime percentage for each kernel torch/_inductor/runtime/triton_heuristics.py:1166:36: Builtin `set` is deprecated 1164 \| # also display the runtime percentage for each kernel 1165 \| percentage = f"{ms / overall_time * 100:.2f}%" 1166 \| suffix = f" \t {percentage} \t {kernel_name}" ^ 1167 \| bw_info_str = create_bandwidth_info_str( 1168 \| ms, torch/_inductor/runtime/triton_heuristics.py:1166:47: Builtin `set` is deprecated 1164 \| # also display the runtime percentage for each kernel 1165 \| percentage = f"{ms / overall_time * 100:.2f}%" 1166 \| suffix = f" \t {percentage} \t {kernel_name}" ^ 1167 \| bw_info_str = create_bandwidth_info_str( 1168 \| ms, torch/_inductor/runtime/triton_heuristics.py:1166:52: Builtin `set` is deprecated 1164 \| # also display the runtime percentage for each kernel 1165 \| percentage = f"{ms / overall_time * 100:.2f}%" 1166 \| suffix = f" \t {percentage} \t {kernel_name}" ^ 1167 \| bw_info_str = create_bandwidth_info_str( 1168 \| ms, torch/_inductor/runtime/triton_heuristics.py:1166:64: Builtin `set` is deprecated 1164 \| # also display the runtime percentage for each kernel 1165 \| percentage = f"{ms / overall_time * 100:.2f}%" 1166 \| suffix = f" \t {percentage} \t {kernel_name}" ^ 1167 \| bw_info_str = create_bandwidth_info_str( 1168 \| ms, torch/_inductor/runtime/triton_heuristics.py:1175:30: Builtin `set` is deprecated 1173 \| ) 1174 \| file.write(bw_info_str + "\n") 1175 \| file.write(f"{summary_str}\n\n") ^ 1176 \| except Exception as e: 1177 \| log.warning( torch/_inductor/runtime/triton_heuristics.py:1175:42: Builtin `set` is deprecated 1173 \| ) 1174 \| file.write(bw_info_str + "\n") 1175 \| file.write(f"{summary_str}\n\n") ^ 1176 \| except Exception as e: 1177 \| log.warning( torch/_inductor/runtime/triton_heuristics.py:1205:29: Builtin `set` is deprecated 1203 \| else: 1204 \| possible_names = _find_names(self) 1205 \| kernel_name = f"{max(possible_names, key=len)}" ^ 1206 \| if not re.match(self.regex_filter, kernel_name): 1207 \| return torch/_inductor/runtime/triton_heuristics.py:1205:58: Builtin `set` is deprecated 1203 \| else: 1204 \| possible_names = _find_names(self) 1205 \| kernel_name = f"{max(possible_names, key=len)}" ^ 1206 \| if not re.match(self.regex_filter, kernel_name): 1207 \| return torch/_inductor/runtime/triton_heuristics.py:1241:60: Builtin `set` is deprecated 1239 \| "%s", 1240 \| create_bandwidth_info_str( 1241 \| ms, num_gb, gb_per_s, suffix=f" \t {kernel_name}" ^ 1242 \| ), 1243 \| ) torch/_inductor/runtime/triton_heuristics.py:1241:72: Builtin `set` is deprecated 1239 \| "%s", 1240 \| create_bandwidth_info_str( 1241 \| ms, num_gb, gb_per_s, suffix=f" \t {kernel_name}" ^ 1242 \| ), 1243 \| ) torch/_inductor/runtime/triton_heuristics.py:1256:15: Builtin `set` is deprecated 1254 \| for cfg in configs: 1255 \| hasher.update( 1256 \| f"{sorted(cfg.kwargs.items())} {cfg.num_warps} {cfg.num_stages}\n".encode() ^ 1257 \| ) 1258 \| return hasher.hexdigest() torch/_inductor/runtime/triton_heuristics.py:1256:42: Builtin `set` is deprecated 1254 \| for cfg in configs: 1255 \| hasher.update( 1256 \| f"{sorted(cfg.kwargs.items())} {cfg.num_warps} {cfg.num_stages}\n".encode() ^ 1257 \| ) 1258 \| return hasher.hexdigest() torch/_inductor/runtime/triton_heuristics.py:1256:44: Builtin `set` is deprecated 1254 \| for cfg in configs: 1255 \| hasher.update( 1256 \| f"{sorted(cfg.kwargs.items())} {cfg.num_warps} {cfg.num_stages}\n".encode() ^ 1257 \| ) 1258 \| return hasher.hexdigest() torch/_inductor/runtime/triton_heuristics.py:1256:58: Builtin `set` is deprecated 1254 \| for cfg in configs: 1255 \| hasher.update( 1256 \| f"{sorted(cfg.kwargs.items())} {cfg.num_warps} {cfg.num_stages}\n".encode() ^ 1257 \| ) 1258 \| return hasher.hexdigest() torch/_inductor/runtime/triton_heuristics.py:1256:60: Builtin `set` is deprecated 1254 \| for cfg in configs: 1255 \| hasher.update( 1256 \| f"{sorted(cfg.kwargs.items())} {cfg.num_warps} {cfg.num_stages}\n".encode() ^ 1257 \| ) 1258 \| return hasher.hexdigest() torch/_inductor/runtime/triton_heuristics.py:1256:75: Builtin `set` is deprecated 1254 \| for cfg in configs: 1255 \| hasher.update( 1256 \| f"{sorted(cfg.kwargs.items())} {cfg.num_warps} {cfg.num_stages}\n".encode() ^ 1257 \| ) 1258 \| return hasher.hexdigest() torch/_inductor/runtime/triton_heuristics.py:1377:23: Builtin `set` is deprecated 1375 \| if numel is None: 1376 \| continue 1377 \| block = cfg[f"{label}BLOCK"] ^ 1378 \| if numel == 1: 1379 \| assert block == 1, ( torch/_inductor/runtime/triton_heuristics.py:1377:29: Builtin `set` is deprecated 1375 \| if numel is None: 1376 \| continue 1377 \| block = cfg[f"{label}BLOCK"] ^ 1378 \| if numel == 1: 1379 \| assert block == 1, ( torch/_inductor/runtime/triton_heuristics.py:1381:24: Builtin `set` is deprecated 1379 \| assert block == 1, ( 1380 \| f"TritonKernel.indexing assumes numel == 1 => BLOCK == 1" 1381 \| f" but {label.lower()}numel=={numel} and {label}BLOCK={block} (cfg={cfg})." ^ 1382 \| ) 1383 \| max_block = TRITON_MAX_BLOCK[label] torch/_inductor/runtime/triton_heuristics.py:1381:38: Builtin `set` is deprecated 1379 \| assert block == 1, ( 1380 \| f"TritonKernel.indexing assumes numel == 1 => BLOCK == 1" 1381 \| f" but {label.lower()}numel=={numel} and {label}BLOCK={block} (cfg={cfg})." ^ 1382 \| ) 1383 \| max_block = TRITON_MAX_BLOCK[label] torch/_inductor/runtime/triton_heuristics.py:1381:46: Builtin `set` is deprecated 1379 \| assert block == 1, ( 1380 \| f"TritonKernel.indexing assumes numel == 1 => BLOCK == 1" 1381 \| f" but {label.lower()}numel=={numel} and {label}BLOCK={block} (cfg={cfg})." ^ 1382 \| ) 1383 \| max_block = TRITON_MAX_BLOCK[label] torch/_inductor/runtime/triton_heuristics.py:1381:52: Builtin `set` is deprecated 1379 \| assert block == 1, ( 1380 \| f"TritonKernel.indexing assumes numel == 1 => BLOCK == 1" 1381 \| f" but {label.lower()}numel=={numel} and {label}BLOCK={block} (cfg={cfg})." ^ 1382 \| ) 1383 \| max_block = TRITON_MAX_BLOCK[label] torch/_inductor/runtime/triton_heuristics.py:1381:58: Builtin `set` is deprecated 1379 \| assert block == 1, ( 1380 \| f"TritonKernel.indexing assumes numel == 1 => BLOCK == 1" 1381 \| f" but {label.lower()}numel=={numel} and {label}BLOCK={block} (cfg={cfg})." ^ 1382 \| ) 1383 \| max_block = TRITON_MAX_BLOCK[label] torch/_inductor/runtime/triton_heuristics.py:1381:64: Builtin `set` is deprecated 1379 \| assert block == 1, ( 1380 \| f"TritonKernel.indexing assumes numel == 1 => BLOCK == 1" 1381 \| f" but {label.lower()}numel=={numel} and {label}BLOCK={block} (cfg={cfg})." ^ 1382 \| ) 1383 \| max_block = TRITON_MAX_BLOCK[label] torch/_inductor/runtime/triton_heuristics.py:1381:71: Builtin `set` is deprecated 1379 \| assert block == 1, ( 1380 \| f"TritonKernel.indexing assumes numel == 1 => BLOCK == 1" 1381 \| f" but {label.lower()}numel=={numel} and {label}BLOCK={block} (cfg={cfg})." ^ 1382 \| ) 1383 \| max_block = TRITON_MAX_BLOCK[label] torch/_inductor/runtime/triton_heuristics.py:1381:77: Builtin `set` is deprecated 1379 \| assert block == 1, ( 1380 \| f"TritonKernel.indexing assumes numel == 1 => BLOCK == 1" 1381 \| f" but {label.lower()}numel=={numel} and {label}BLOCK={block} (cfg={cfg})." ^ 1382 \| ) 1383 \| max_block = TRITON_MAX_BLOCK[label] torch/_inductor/runtime/triton_heuristics.py:1381:84: Builtin `set` is deprecated 1379 \| assert block == 1, ( 1380 \| f"TritonKernel.indexing assumes numel == 1 => BLOCK == 1" 1381 \| f" but {label.lower()}numel=={numel} and {label}BLOCK={block} (cfg={cfg})." ^ 1382 \| ) 1383 \| max_block = TRITON_MAX_BLOCK[label] torch/_inductor/runtime/triton_heuristics.py:1381:88: Builtin `set` is deprecated 1379 \| assert block == 1, ( 1380 \| f"TritonKernel.indexing assumes numel == 1 => BLOCK == 1" 1381 \| f" but {label.lower()}numel=={numel} and {label}BLOCK={block} (cfg={cfg})." ^ 1382 \| ) 1383 \| max_block = TRITON_MAX_BLOCK[label] torch/_inductor/runtime/triton_heuristics.py:1384:52: Builtin `set` is deprecated 1382 \| ) 1383 \| max_block = TRITON_MAX_BLOCK[label] 1384 \| max_block_str = f'config.triton.max_block["{label}"]' ^ 1385 \| assert max_block % block == 0, ( 1386 \| f"TritonKernel.indexing assumes {label}BLOCK divides {max_block_str}" torch/_inductor/runtime/triton_heuristics.py:1384:58: Builtin `set` is deprecated 1382 \| ) 1383 \| max_block = TRITON_MAX_BLOCK[label] 1384 \| max_block_str = f'config.triton.max_block["{label}"]' ^ 1385 \| assert max_block % block == 0, ( 1386 \| f"TritonKernel.indexing assumes {label}BLOCK divides {max_block_str}" torch/_inductor/runtime/triton_heuristics.py:1386:45: Builtin `set` is deprecated 1384 \| max_block_str = f'config.triton.max_block["{label}"]' 1385 \| assert max_block % block == 0, ( 1386 \| f"TritonKernel.indexing assumes {label}BLOCK divides {max_block_str}" ^ 1387 \| f" but {label}BLOCK={block} and {max_block_str}={max_block} (cfg={cfg})." 1388 \| ) torch/_inductor/runtime/triton_heuristics.py:1386:51: Builtin `set` is deprecated 1384 \| max_block_str = f'config.triton.max_block["{label}"]' 1385 \| assert max_block % block == 0, ( 1386 \| f"TritonKernel.indexing assumes {label}BLOCK divides {max_block_str}" ^ 1387 \| f" but {label}BLOCK={block} and {max_block_str}={max_block} (cfg={cfg})." 1388 \| ) torch/_inductor/runtime/triton_heuristics.py:1386:66: Builtin `set` is deprecated 1384 \| max_block_str = f'config.triton.max_block["{label}"]' 1385 \| assert max_block % block == 0, ( 1386 \| f"TritonKernel.indexing assumes {label}BLOCK divides {max_block_str}" ^ 1387 \| f" but {label}BLOCK={block} and {max_block_str}={max_block} (cfg={cfg})." 1388 \| ) torch/_inductor/runtime/triton_heuristics.py:1386:80: Builtin `set` is deprecated 1384 \| max_block_str = f'config.triton.max_block["{label}"]' 1385 \| assert max_block % block == 0, ( 1386 \| f"TritonKernel.indexing assumes {label}BLOCK divides {max_block_str}" ^ 1387 \| f" but {label}BLOCK={block} and {max_block_str}={max_block} (cfg={cfg})." 1388 \| ) torch/_inductor/runtime/triton_heuristics.py:1387:20: Builtin `set` is deprecated 1385 \| assert max_block % block == 0, ( 1386 \| f"TritonKernel.indexing assumes {label}BLOCK divides {max_block_str}" 1387 \| f" but {label}BLOCK={block} and {max_block_str}={max_block} (cfg={cfg})." ^ 1388 \| ) 1389 \| torch/_inductor/runtime/triton_heuristics.py:1387:26: Builtin `set` is deprecated 1385 \| assert max_block % block == 0, ( 1386 \| f"TritonKernel.indexing assumes {label}BLOCK divides {max_block_str}" 1387 \| f" but {label}BLOCK={block} and {max_block_str}={max_block} (cfg={cfg})." ^ 1388 \| ) 1389 \| torch/_inductor/runtime/triton_heuristics.py:1387:33: Builtin `set` is deprecated 1385 \| assert max_block % block == 0, ( 1386 \| f"TritonKernel.indexing assumes {label}BLOCK divides {max_block_str}" 1387 \| f" but {label}BLOCK={block} and {max_block_str}={max_block} (cfg={cfg})." ^ 1388 \| ) 1389 \| torch/_inductor/runtime/triton_heuristics.py:1387:39: Builtin `set` is deprecated 1385 \| assert max_block % block == 0, ( 1386 \| f"TritonKernel.indexing assumes {label}BLOCK divides {max_block_str}" 1387 \| f" but {label}BLOCK={block} and {max_block_str}={max_block} (cfg={cfg})." ^ 1388 \| ) 1389 \| torch/_inductor/runtime/triton_heuristics.py:1387:45: Builtin `set` is deprecated 1385 \| assert max_block % block == 0, ( 1386 \| f"TritonKernel.indexing assumes {label}BLOCK divides {max_block_str}" 1387 \| f" but {label}BLOCK={block} and {max_block_str}={max_block} (cfg={cfg})." ^ 1388 \| ) 1389 \| torch/_inductor/runtime/triton_heuristics.py:1387:59: Builtin `set` is deprecated 1385 \| assert max_block % block == 0, ( 1386 \| f"TritonKernel.indexing assumes {label}BLOCK divides {max_block_str}" 1387 \| f" but {label}BLOCK={block} and {max_block_str}={max_block} (cfg={cfg})." ^ 1388 \| ) 1389 \| torch/_inductor/runtime/triton_heuristics.py:1387:61: Builtin `set` is deprecated 1385 \| assert max_block % block == 0, ( 1386 \| f"TritonKernel.indexing assumes {label}BLOCK divides {max_block_str}" 1387 \| f" but {label}BLOCK={block} and {max_block_str}={max_block} (cfg={cfg})." ^ 1388 \| ) 1389 \| torch/_inductor/runtime/triton_heuristics.py:1387:71: Builtin `set` is deprecated 1385 \| assert max_block % block == 0, ( 1386 \| f"TritonKernel.indexing assumes {label}BLOCK divides {max_block_str}" 1387 \| f" but {label}BLOCK={block} and {max_block_str}={max_block} (cfg={cfg})." ^ 1388 \| ) 1389 \| torch/_inductor/runtime/triton_heuristics.py:1387:78: Builtin `set` is deprecated 1385 \| assert max_block % block == 0, ( 1386 \| f"TritonKernel.indexing assumes {label}BLOCK divides {max_block_str}" 1387 \| f" but {label}BLOCK={block} and {max_block_str}={max_block} (cfg={cfg})." ^ 1388 \| ) 1389 \| torch/_inductor/runtime/triton_heuristics.py:1387:82: Builtin `set` is deprecated 1385 \| assert max_block % block == 0, ( 1386 \| f"TritonKernel.indexing assumes {label}BLOCK divides {max_block_str}" 1387 \| f" but {label}BLOCK={block} and {max_block_str}={max_block} (cfg={cfg})." ^ 1388 \| ) 1389 \| torch/_inductor/runtime/triton_heuristics.py:1402:19: Builtin `set` is deprecated 1400 \| assert ( 1401 \| val <= max_block 1402 \| ), f"'{var}' too large. Maximum: {max_block}. Actual: {val}." ^ 1403 \| 1404 \| torch/_inductor/runtime/triton_heuristics.py:1402:23: Builtin `set` is deprecated 1400 \| assert ( 1401 \| val <= max_block 1402 \| ), f"'{var}' too large. Maximum: {max_block}. Actual: {val}." ^ 1403 \| 1404 \| torch/_inductor/runtime/triton_heuristics.py:1402:46: Builtin `set` is deprecated 1400 \| assert ( 1401 \| val <= max_block 1402 \| ), f"'{var}' too large. Maximum: {max_block}. Actual: {val}." ^ 1403 \| 1404 \| torch/_inductor/runtime/triton_heuristics.py:1402:56: Builtin `set` is deprecated 1400 \| assert ( 1401 \| val <= max_block 1402 \| ), f"'{var}' too large. Maximum: {max_block}. Actual: {val}." ^ 1403 \| 1404 \| torch/_inductor/runtime/triton_heuristics.py:1402:67: Builtin `set` is deprecated 1400 \| assert ( 1401 \| val <= max_block 1402 \| ), f"'{var}' too large. Maximum: {max_block}. Actual: {val}." ^ 1403 \| 1404 \| torch/_inductor/runtime/triton_heuristics.py:1402:71: Builtin `set` is deprecated 1400 \| assert ( 1401 \| val <= max_block 1402 \| ), f"'{var}' too large. Maximum: {max_block}. Actual: {val}." ^ 1403 \| 1404 \| torch/_inductor/runtime/triton_heuristics.py:1551:21: Builtin `set` is deprecated 1549 \| rnumels = {} 1550 \| for idx in range(num_reduction_dims - 1, -1, -1): 1551 \| prefix = f"r{idx}_" ^ 1552 \| max_size = min(size_hints[prefix], TRITON_MAX_BLOCK[prefix.upper()]) 1553 \| dim = min(max_size, remaining) torch/_inductor/runtime/triton_heuristics.py:1551:25: Builtin `set` is deprecated 1549 \| rnumels = {} 1550 \| for idx in range(num_reduction_dims - 1, -1, -1): 1551 \| prefix = f"r{idx}_" ^ 1552 \| max_size = min(size_hints[prefix], TRITON_MAX_BLOCK[prefix.upper()]) 1553 \| dim = min(max_size, remaining) torch/_inductor/runtime/triton_heuristics.py:1556:34: Builtin `set` is deprecated 1554 \| assert ( 1555 \| remaining % dim == 0 1556 \| ), f"Expected dimension '{dim}' to divide remaining size '{remaining}'" ^ 1557 \| rnumels[prefix] = dim 1558 \| remaining //= dim torch/_inductor/runtime/triton_heuristics.py:1556:38: Builtin `set` is deprecated 1554 \| assert ( 1555 \| remaining % dim == 0 1556 \| ), f"Expected dimension '{dim}' to divide remaining size '{remaining}'" ^ 1557 \| rnumels[prefix] = dim 1558 \| remaining //= dim torch/_inductor/runtime/triton_heuristics.py:1556:67: Builtin `set` is deprecated 1554 \| assert ( 1555 \| remaining % dim == 0 1556 \| ), f"Expected dimension '{dim}' to divide remaining size '{remaining}'" ^ 1557 \| rnumels[prefix] = dim 1558 \| remaining //= dim torch/_inductor/runtime/triton_heuristics.py:1556:77: Builtin `set` is deprecated 1554 \| assert ( 1555 \| remaining % dim == 0 1556 \| ), f"Expected dimension '{dim}' to divide remaining size '{remaining}'" ^ 1557 \| rnumels[prefix] = dim 1558 \| remaining //= dim torch/_inductor/runtime/triton_heuristics.py:1564:38: Builtin `set` is deprecated 1562 \| assert ( 1563 \| r == final_numel 1564 \| ), f"Expected ND reduction size ({rnumels}) to have {r} elements." ^ 1565 \| assert all( 1566 \| rnumels[prefix] <= size_hints[prefix] for prefix in rnumels torch/_inductor/runtime/triton_heuristics.py:1564:46: Builtin `set` is deprecated 1562 \| assert ( 1563 \| r == final_numel 1564 \| ), f"Expected ND reduction size ({rnumels}) to have {r} elements." ^ 1565 \| assert all( 1566 \| rnumels[prefix] <= size_hints[prefix] for prefix in rnumels torch/_inductor/runtime/triton_heuristics.py:1564:57: Builtin `set` is deprecated 1562 \| assert ( 1563 \| r == final_numel 1564 \| ), f"Expected ND reduction size ({rnumels}) to have {r} elements." ^ 1565 \| assert all( 1566 \| rnumels[prefix] <= size_hints[prefix] for prefix in rnumels torch/_inductor/runtime/triton_heuristics.py:1564:59: Builtin `set` is deprecated 1562 \| assert ( 1563 \| r == final_numel 1564 \| ), f"Expected ND reduction size ({rnumels}) to have {r} elements." ^ 1565 \| assert all( 1566 \| rnumels[prefix] <= size_hints[prefix] for prefix in rnumels torch/_inductor/runtime/triton_heuristics.py:1567:37: Builtin `set` is deprecated 1565 \| assert all( 1566 \| rnumels[prefix] <= size_hints[prefix] for prefix in rnumels 1567 \| ), f"rnumels exceed size_hints. {rnumels} > {size_hints}" ^ 1568 \| 1569 \| return rnumels torch/_inductor/runtime/triton_heuristics.py:1567:45: Builtin `set` is deprecated 1565 \| assert all( 1566 \| rnumels[prefix] <= size_hints[prefix] for prefix in rnumels 1567 \| ), f"rnumels exceed size_hints. {rnumels} > {size_hints}" ^ 1568 \| 1569 \| return rnumels torch/_inductor/runtime/triton_heuristics.py:1567:49: Builtin `set` is deprecated 1565 \| assert all( 1566 \| rnumels[prefix] <= size_hints[prefix] for prefix in rnumels 1567 \| ), f"rnumels exceed size_hints. {rnumels} > {size_hints}" ^ 1568 \| 1569 \| return rnumels torch/_inductor/runtime/triton_heuristics.py:1567:60: Builtin `set` is deprecated 1565 \| assert all( 1566 \| rnumels[prefix] <= size_hints[prefix] for prefix in rnumels 1567 \| ), f"rnumels exceed size_hints. {rnumels} > {size_hints}" ^ 1568 \| 1569 \| return rnumels torch/_inductor/runtime/triton_heuristics.py:1746:49: Builtin `set` is deprecated 1744 \| 1745 \| if not configs: 1746 \| raise NotImplementedError(f"size_hints: {size_hints}") ^ 1747 \| return cached_autotune( 1748 \| size_hints, torch/_inductor/runtime/triton_heuristics.py:1746:60: Builtin `set` is deprecated 1744 \| 1745 \| if not configs: 1746 \| raise NotImplementedError(f"size_hints: {size_hints}") ^ 1747 \| return cached_autotune( 1748 \| size_hints, torch/_inductor/runtime/triton_heuristics.py:1928:32: Builtin `set` is deprecated 1926 \| for prefix in size_hints: 1927 \| if prefix_is_reduction(prefix): 1928 \| c.kwargs.pop(f"{prefix.upper()}BLOCK") ^ 1929 \| 1930 \| if disable_pointwise_autotuning(inductor_meta): torch/_inductor/runtime/triton_heuristics.py:1928:47: Builtin `set` is deprecated 1926 \| for prefix in size_hints: 1927 \| if prefix_is_reduction(prefix): 1928 \| c.kwargs.pop(f"{prefix.upper()}BLOCK") ^ 1929 \| 1930 \| if disable_pointwise_autotuning(inductor_meta): torch/_inductor/runtime/triton_heuristics.py:1975:49: Builtin `set` is deprecated 1973 \| assert triton_meta is not None 1974 \| if len(size_hints) != 2: 1975 \| raise NotImplementedError(f"size_hints: {size_hints}") ^ 1976 \| 1977 \| configs = _reduction_configs(size_hints=size_hints, inductor_meta=inductor_meta) torch/_inductor/runtime/triton_heuristics.py:1975:60: Builtin `set` is deprecated 1973 \| assert triton_meta is not None 1974 \| if len(size_hints) != 2: 1975 \| raise NotImplementedError(f"size_hints: {size_hints}") ^ 1976 \| 1977 \| configs = _reduction_configs(size_hints=size_hints, inductor_meta=inductor_meta) torch/_inductor/runtime/triton_heuristics.py:2082:56: Builtin `set` is deprecated 2080 \| xnumel, ynumel, znumel = numels[2], numels[1], numels[0] 2081 \| else: 2082 \| raise AssertionError(f"invalid size for numels {len(numels)}") ^ 2083 \| 2084 \| def get_grid_dim(numel, block): torch/_inductor/runtime/triton_heuristics.py:2082:68: Builtin `set` is deprecated 2080 \| xnumel, ynumel, znumel = numels[2], numels[1], numels[0] 2081 \| else: 2082 \| raise AssertionError(f"invalid size for numels {len(numels)}") ^ 2083 \| 2084 \| def get_grid_dim(numel, block): torch/_inductor/runtime/triton_heuristics.py:2104:57: Builtin `set` is deprecated 2102 \| torch._check( 2103 \| y_grid <= max_y_grid, 2104 \| lambda: f"Generated y grid beyond 2^16 ({y_grid}) not supported with z dimension present. File issue", ^ 2105 \| ) 2106 \| torch/_inductor/runtime/triton_heuristics.py:2104:64: Builtin `set` is deprecated 2102 \| torch._check( 2103 \| y_grid <= max_y_grid, 2104 \| lambda: f"Generated y grid beyond 2^16 ({y_grid}) not supported with z dimension present. File issue", ^ 2105 \| ) 2106 \| torch/_inductor/runtime/triton_heuristics.py:2113:43: Builtin `set` is deprecated 2111 \| ) 2112 \| 2113 \| setattr(grid_fn, "grid_fn_str", f"grid{numels}") # noqa: B010 ^ 2114 \| 2115 \| return grid_fn torch/_inductor/runtime/triton_heuristics.py:2113:50: Builtin `set` is deprecated 2111 \| ) 2112 \| 2113 \| setattr(grid_fn, "grid_fn_str", f"grid{numels}") # noqa: B010 ^ 2114 \| 2115 \| return grid_fn torch/_inductor/runtime/triton_heuristics.py:2122:48: Builtin `set` is deprecated 2120 \| return (meta["RSPLIT"], ceildiv(xnumel, meta.get("XBLOCK", 1)), 1) 2121 \| 2122 \| grid_fn_str = f"cooperative_reduction_grid({xnumel})" ^ 2123 \| setattr(grid_fn, "grid_fn_str", grid_fn_str) # noqa: B010 2124 \| return grid_fn torch/_inductor/runtime/triton_heuristics.py:2122:55: Builtin `set` is deprecated 2120 \| return (meta["RSPLIT"], ceildiv(xnumel, meta.get("XBLOCK", 1)), 1) 2121 \| 2122 \| grid_fn_str = f"cooperative_reduction_grid({xnumel})" ^ 2123 \| setattr(grid_fn, "grid_fn_str", grid_fn_str) # noqa: B010 2124 \| return grid_fn torch/_inductor/runtime/triton_heuristics.py:2135:54: Builtin `set` is deprecated 2133 \| coop_grid = cooperative_reduction_grid(xnumel) 2134 \| normal_grid = grid(xnumel) 2135 \| grid_fn_str = f"maybe_cooperative_reduction_grid({xnumel})" ^ 2136 \| setattr(grid_fn, "grid_fn_str", grid_fn_str) # noqa: B010 2137 \| return grid_fn torch/_inductor/runtime/triton_heuristics.py:2135:61: Builtin `set` is deprecated 2133 \| coop_grid = cooperative_reduction_grid(xnumel) 2134 \| normal_grid = grid(xnumel) 2135 \| grid_fn_str = f"maybe_cooperative_reduction_grid({xnumel})" ^ 2136 \| setattr(grid_fn, "grid_fn_str", grid_fn_str) # noqa: B010 2137 \| return grid_fn torch/_inductor/runtime/triton_heuristics.py:2145:37: Builtin `set` is deprecated 2143 \| return (ceildiv(rnumel, meta.get("R0_BLOCK", 1)), xnumel, 1) 2144 \| 2145 \| grid_fn_str = f"split_scan_grid({xnumel}, {rnumel})" ^ 2146 \| setattr(grid_fn, "grid_fn_str", grid_fn_str) # noqa: B010 2147 \| torch/_inductor/runtime/triton_heuristics.py:2145:44: Builtin `set` is deprecated 2143 \| return (ceildiv(rnumel, meta.get("R0_BLOCK", 1)), xnumel, 1) 2144 \| 2145 \| grid_fn_str = f"split_scan_grid({xnumel}, {rnumel})" ^ 2146 \| setattr(grid_fn, "grid_fn_str", grid_fn_str) # noqa: B010 2147 \| torch/_inductor/runtime/triton_heuristics.py:2145:47: Builtin `set` is deprecated 2143 \| return (ceildiv(rnumel, meta.get("R0_BLOCK", 1)), xnumel, 1) 2144 \| 2145 \| grid_fn_str = f"split_scan_grid({xnumel}, {rnumel})" ^ 2146 \| setattr(grid_fn, "grid_fn_str", grid_fn_str) # noqa: B010 2147 \| torch/_inductor/runtime/triton_heuristics.py:2145:54: Builtin `set` is deprecated 2143 \| return (ceildiv(rnumel, meta.get("R0_BLOCK", 1)), xnumel, 1) 2144 \| 2145 \| grid_fn_str = f"split_scan_grid({xnumel}, {rnumel})" ^ 2146 \| setattr(grid_fn, "grid_fn_str", grid_fn_str) # noqa: B010 2147 \| torch/_inductor/runtime/triton_heuristics.py:2173:42: Builtin `set` is deprecated 2171 \| assert ( 2172 \| min_blocks_d is None or min_blocks == min_blocks_d 2173 \| ), f"inconsistent min_blocks {min_blocks} vs x grid {numels[-1]}" ^ 2174 \| else: 2175 \| # sequential dispatch torch/_inductor/runtime/triton_heuristics.py:2173:53: Builtin `set` is deprecated 2171 \| assert ( 2172 \| min_blocks_d is None or min_blocks == min_blocks_d 2173 \| ), f"inconsistent min_blocks {min_blocks} vs x grid {numels[-1]}" ^ 2174 \| else: 2175 \| # sequential dispatch torch/_inductor/runtime/triton_heuristics.py:2173:66: Builtin `set` is deprecated 2171 \| assert ( 2172 \| min_blocks_d is None or min_blocks == min_blocks_d 2173 \| ), f"inconsistent min_blocks {min_blocks} vs x grid {numels[-1]}" ^ 2174 \| else: 2175 \| # sequential dispatch torch/_inductor/runtime/triton_heuristics.py:2173:77: Builtin `set` is deprecated 2171 \| assert ( 2172 \| min_blocks_d is None or min_blocks == min_blocks_d 2173 \| ), f"inconsistent min_blocks {min_blocks} vs x grid {numels[-1]}" ^ 2174 \| else: 2175 \| # sequential dispatch ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/143628 Approved by: https://github.com/yanboliang, https://github.com/rec	2024-12-20 11:45:26 +00:00
Xu Han	6733045a4a	export AOTI_TORCH_EXPORT on Windows. (#140030 ) Fixes #139954 reproduce UT: ```cmd pytest test/inductor/test_torchinductor_codegen_dynamic_shapes.py -k test_device_assert_dynamic_shapes_cpu ``` Issue: <img width="856" alt="image" src="https://github.com/user-attachments/assets/5fc501a9-54e5-45ac-9fb3-509ec11a7abe"> After fixing: ![Image](https://github.com/user-attachments/assets/883846fb-8e92-4b9c-9400-daab32382a3a) Reland: 1. Declare export on Windows explicitly. 2. Support cpu, cuda and xpu devices. Pull Request resolved: https://github.com/pytorch/pytorch/pull/140030 Approved by: https://github.com/jgong5, https://github.com/desertfire	2024-12-20 11:42:09 +00:00
Michael Lazos	b539c61631	[Hierarchical Compile] Update NoneAsConstantBuffer to support graph d… (#143531 ) Fixes issues I hit while running graph deduplication with torch tune. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143531 Approved by: https://github.com/eellison	2024-12-20 09:23:12 +00:00
Pian Pawakapan	f9f82ca48f	[ts converter] use Dim.AUTO for ts -> export converter (#138273 ) Switches TS converter to use `Dim.AUTO` by default, exporting models with max dynamism. Adds runtime input tests to `test_converter.py` Pull Request resolved: https://github.com/pytorch/pytorch/pull/138273 Approved by: https://github.com/avikchaudhuri	2024-12-20 07:48:24 +00:00
Michael Lazos	270ad513c8	[Dynamo] only import einops if version is lower than 0.7.0 (#142847 ) Fixes internal xref (https://fb.workplace.com/groups/257735836456307/posts/804793021750583/?comment_id=805229281706957&reply_comment_id=805232695039949) Pull Request resolved: https://github.com/pytorch/pytorch/pull/142847 Approved by: https://github.com/zou3519	2024-12-20 07:46:49 +00:00
Avik Chaudhuri	29b586bbad	fix formatting in programming model doc (#143587 ) Test Plan: Some of the formatting in https://docs-preview.pytorch.org/pytorch/pytorch/143546/export.programming_model.html is broken. Differential Revision: D67458972 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143587 Approved by: https://github.com/yushangdi	2024-12-20 07:09:19 +00:00
Huy Do	fe0f20615c	[DynamoBench] Handle accuracy results in benchmark records (#143611 ) I discovered this issue when trying to search for the accuracy results on the database and couldn't find any. It turns out that the results is there on the JSON file, for example `"metric": {"name": "accuracy", "benchmark_values": ["pass_due_to_skip"]}`, but inserting them into the database fails because benchmark values is a list of strings here while the expectation is that it's a list of numbers. ClickHouse doesn't support mix types atm. It has a Variant type https://clickhouse.com/docs/en/sql-reference/data-types/variant, but this isn't recommended by CH team themselves. So, the remaining option is to store this in the `extra_info` field. This field is a dictionary, so it can goes there. ### Testing https://github.com/pytorch/pytorch/actions/runs/12421747715 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143611 Approved by: https://github.com/kit1980	2024-12-20 06:43:38 +00:00
Sam Ginzburg	132fcf4e0d	[user triton] Raise an exception when encountering nested @triton.autotune decorators or @triton.heuristics (#143519 ) We support running a single Autotuner for each Triton kernel. Currently, if there are multiple autotuning decorators, the subsequent ones will be silently ignored. Instead, we should raise an error here to avoid silent incorrectness. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143519 Approved by: https://github.com/aakhundov	2024-12-20 06:38:45 +00:00
PyTorch MergeBot	71479a9b9c	Revert "[AOTI] Emit a CMakeLists.txt when package_cpp_only (#143352 )" This reverts commit 429f4cd1408b11a7b0dd10634b46b3265dc31af1. Reverted https://github.com/pytorch/pytorch/pull/143352 on behalf of https://github.com/huydhn due to Sorry for reverting your change but the new test is failing on ROCm ([comment](https://github.com/pytorch/pytorch/pull/143352#issuecomment-2556365140))	2024-12-20 06:21:31 +00:00
Jane Xu	4e29e4aa63	[BE] Add a test to ensure grads are never inplaced into accidentally (#143612 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143612 Approved by: https://github.com/soulitzer	2024-12-20 06:15:08 +00:00
Xu Han	2daa666591	update kineto to XPU Windows fixed PR. [submodule kineto] (#143445 ) Include XPU Windows Fixed PR: https://github.com/pytorch/kineto/pull/1012 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143445 Approved by: https://github.com/sraikund16	2024-12-20 05:57:30 +00:00
zeshengzong	217a4ddb04	Add range check embedding_bag on input index >= 0 of cuda device (#140791 ) Fixes #89362 Test Result Before ``` >>> import torch >>> input = torch.randint(-5, 1, [1, 2], dtype=torch.int64).cuda() >>> weight = torch.rand([2, 3], dtype=torch.float32).cuda() >>> print(torch.nn.functional.embedding_bag(input, weight)) tensor([[0., 0., 0.]], device='cuda:0') ``` After ```python >>> import torch >>> input = torch.randint(-5, 1, [1, 2], dtype=torch.int64).cuda() >>> weight = torch.rand([2, 3], dtype=torch.float32).cuda() >>> print(torch.nn.functional.embedding_bag(input, weight)) /home/zong/code/pytorch/aten/src/ATen/native/cuda/EmbeddingBag.cu:141: EmbeddingBag_updateOutputKernel_sum_mean: block: [0,0,0], thread: [0,0,0] Assertion `0 <= input_idx && input_idx < numRows` failed. /home/zong/code/pytorch/aten/src/ATen/native/cuda/EmbeddingBag.cu:141: EmbeddingBag_updateOutputKernel_sum_mean: block: [0,0,0], thread: [1,0,0] Assertion `0 <= input_idx && input_idx < numRows` failed. /home/zong/code/pytorch/aten/src/ATen/native/cuda/EmbeddingBag.cu:141: EmbeddingBag_updateOutputKernel_sum_mean: block: [0,0,0], thread: [2,0,0] Assertion `0 <= input_idx && input_idx < numRows` failed. Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/home/zong/code/pytorch/torch/_tensor.py", line 568, in __repr__ return torch._tensor_str._str(self, tensor_contents=tensor_contents) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/zong/code/pytorch/torch/_tensor_str.py", line 708, in _str return _str_intern(self, tensor_contents=tensor_contents) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/zong/code/pytorch/torch/_tensor_str.py", line 625, in _str_intern tensor_str = _tensor_str(self, indent) ^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/zong/code/pytorch/torch/_tensor_str.py", line 357, in _tensor_str formatter = _Formatter(get_summarized_data(self) if summarize else self) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/zong/code/pytorch/torch/_tensor_str.py", line 146, in __init__ tensor_view, torch.isfinite(tensor_view) & tensor_view.ne(0) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ RuntimeError: CUDA error: device-side assert triggered CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1 Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions. ``` ```bash $ pytest test/nn/test_embedding.py ``` ![image](https://github.com/user-attachments/assets/6a5ec759-a3dc-4d51-9e5e-ec79c0aac526) ```bash $ lintrunner ``` ![image](https://github.com/user-attachments/assets/2ce4ac24-74fb-4181-9510-18b96a2c2acb) Pull Request resolved: https://github.com/pytorch/pytorch/pull/140791 Approved by: https://github.com/eqy	2024-12-20 05:47:26 +00:00
bobrenjc93	9713a6eeca	remove allow-untyped-defs from torch/fx/experimental/refinement_types.py (#143602 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143602 Approved by: https://github.com/aorenste	2024-12-20 05:40:52 +00:00
bobrenjc93	78d294379a	remove allow-untyped-defs from torch/_lazy/config.py (#143603 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143603 Approved by: https://github.com/aorenste	2024-12-20 05:34:19 +00:00
bobrenjc93	cb4e9888df	remove allow-untyped-defs from torch/ao/quantization/experimental/APoT_tensor.py (#143601 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143601 Approved by: https://github.com/aorenste	2024-12-20 05:26:09 +00:00
bobrenjc93	dd346dbeab	remove allow-untyped-defs from torch/distributed/elastic/multiprocessing/errors/handlers.py (#143605 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143605 Approved by: https://github.com/aorenste	2024-12-20 05:25:01 +00:00
Michael Lazos	fd23cf5848	[Dynamo] check node class first for graph dedup (#143609 ) as title Pull Request resolved: https://github.com/pytorch/pytorch/pull/143609 Approved by: https://github.com/williamwen42	2024-12-20 04:09:46 +00:00
William Wen	1c2593f035	[dynamo] guard global autocast state (#143592 ) Fixes https://github.com/pytorch/pytorch/issues/112260. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143592 Approved by: https://github.com/jansel	2024-12-20 03:30:54 +00:00
drisspg	d339f1506b	Add cutlass version guard in prep for upgrade (#143551 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143551 Approved by: https://github.com/eqy	2024-12-20 02:40:02 +00:00
Mayank Mishra	75661f2036	try root fix for FP8 tensor (#143248 ) Fixes #143194 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143248 Approved by: https://github.com/fegin	2024-12-20 01:57:17 +00:00
PyTorch MergeBot	4462cc6375	Revert "[Inductor] inplace padding (#140249 )" This reverts commit 297ce776363cc4802fa74d210fced2b4128960d5. Reverted https://github.com/pytorch/pytorch/pull/140249 on behalf of https://github.com/huydhn due to This break an internal test https://fburl.com/test/ppl2we5l ([comment](https://github.com/pytorch/pytorch/pull/140249#issuecomment-2556079406))	2024-12-20 01:30:27 +00:00
bobrenjc93	e1b4635504	remove allow-untyped-defs from torch/distributed/pipelining/_debug.py (#143606 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143606 Approved by: https://github.com/aorenste	2024-12-20 01:26:51 +00:00
Jane Xu	a0cff096bc	Improve cond error messaging (#143595 ) Discovered by @drisspg and I trying out a simple toy example and being way too confused :') Pull Request resolved: https://github.com/pytorch/pytorch/pull/143595 Approved by: https://github.com/zou3519, https://github.com/ydwu4	2024-12-20 01:19:20 +00:00
Yanan Cao (PyTorch)	d547fae5b0	[Codemod][AddExplicitStrictExportArg] caffe2/torch/onnx/_internal/exporter (#143542 ) Reviewed By: avikchaudhuri Differential Revision: D67381244 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143542 Approved by: https://github.com/ydwu4, https://github.com/titaiwangms	2024-12-20 00:54:52 +00:00
Sun, Jiayi	544de4008e	[Inductor] Constrain the shape of other tensor for Conv/Linear + broadcast add fusion. (#141759 ) Fix https://github.com/pytorch/pytorch/issues/141671. Summary: The performance regression of these two timm_models is caused by Conv/Linear + broadcast add fusion run into oneDNN ref path. This PR constrains the shape of other tensor for Conv/Linear + broadcast add fusion to fix this issue. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141759 Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel, https://github.com/jansel	2024-12-20 00:35:58 +00:00
PyTorch MergeBot	8136daff5a	Revert "[ARM][feat]: Add 4 bit dynamic quantization matmuls & KleidiAI Backend (#134124 )" This reverts commit 4b82251011f85f9d1395b451d61e976af844d9b1. Reverted https://github.com/pytorch/pytorch/pull/134124 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it breaks lots of internal build ([comment](https://github.com/pytorch/pytorch/pull/134124#issuecomment-2555953189))	2024-12-19 23:33:17 +00:00
PyTorch MergeBot	145fd5bad0	Revert "[Dynamo] only import einops if version is lower than 0.7.0 (#142847 )" This reverts commit a96387a481633389a6b5a5ac7b8406e9216f320e. Reverted https://github.com/pytorch/pytorch/pull/142847 on behalf of https://github.com/huydhn due to This has been reverted internally D67436053 ([comment](https://github.com/pytorch/pytorch/pull/142847#issuecomment-2555942351))	2024-12-19 23:22:44 +00:00
Sun, Jiayi	d2b83aa122	add grad_output shape check for fractional_max_pool2d_backward (#141666 ) Fix https://github.com/pytorch/pytorch/issues/141102. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141666 Approved by: https://github.com/mingfeima, https://github.com/malfet	2024-12-19 22:47:02 +00:00
Evgeny Fiksman	2def1f6f74	[caffe2] Move vectorized templates into a separate file for box_cox operator (#143556 ) Summary: No functional changes in this diff, the code is moved into a separate file to be reused by avx512 version in the follow up diff. Test Plan: buck build //caffe2/caffe2/perfkernels:perfkernels Differential Revision: D67433115 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143556 Approved by: https://github.com/hl475	2024-12-19 22:02:23 +00:00
Bin Bao	429f4cd140	[AOTI] Emit a CMakeLists.txt when package_cpp_only (#143352 ) Summary: Emit a CMakeLists.txt with compile and link options when package_cpp_only is specified. After unzipping AOTI generated .pt2 package file, user can manually build the generated model code in their local environment. Differential Revision: [D67458526](https://our.internmc.facebook.com/intern/diff/D67458526) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143352 Approved by: https://github.com/malfet	2024-12-19 22:01:05 +00:00
PyTorch MergeBot	e9bd74d763	Revert "[export] don't decompose custom triton op when exporting (#142426 )" This reverts commit 10b9c5944e8d6ff0685e1ef25277a1d3c4c9c5aa. Reverted https://github.com/pytorch/pytorch/pull/142426 on behalf of https://github.com/huydhn due to This fails one internal MTIA test, checking with the author that we need to revert and reland this ([comment](https://github.com/pytorch/pytorch/pull/142426#issuecomment-2555793496))	2024-12-19 21:21:38 +00:00
Joel Schlosser	fc03c62c56	Unbacked SymInt fixes for subclasses + data-dependent slice() bounds (#142062 ) Related: #125914 (specifically see [comment](https://github.com/pytorch/pytorch/issues/125914#issuecomment-2513044125)) This PR addresses two broken things involving the usage of unbacked SymInts for calls to `slice()` with data-dependent bounds. These issues are encountered in practice for `narrow()` operating on the batch dim with an NJT input, but apply to other subclasses as well. The test in this PR uses a purpose-built subclass. There are two different issues here, depending on whether `torch.compile()` is called with `dynamic=True`. In practice, these only occur when the unbacked SymInts are created within the torch_dispatch implementation of a subclass, because the unbacked symbols are considered "freshly created" when the output subclass instance is handled in Dynamo. Error 1 (dynamic=False): ``` LoweringException: GuardOnDataDependentSymNode: Could not guard on data-dependent expression Eq(-Min(22, Max(0, u0)) + Min(22, Max(u0 + u1, Max(0, u0))), 0) (unhinted: Eq(-Min(s0, Max(0, u0)) + Min(s0, Max(u0 + u1, Max(0, u0))), 0)). (Size-like symbols: u1, u0) ``` The expression comes from the use of `clamp()` logic for `SliceView` in Inductor: `41e59754b4/torch/_inductor/ir.py (L3014)` If the (start, end) bounds for the `slice()` are statically known to be in range for the given dim (e.g. provided via `torch._check()` calls), we can avoid this `clamp()` logic and the error. This PR implements this fix. Error 2 (dynamic=True): ``` torch._dynamo.exc.InternalTorchDynamoError: PendingUnbackedSymbolNotFound: Pending unbacked symbols {u0} not in returned outputs NestedTensor(size=(2, s16, s1), offsets=FakeTensor(..., device='cuda:0', size=(3,), dtype=torch.int64), grad_fn=<NarrowBackwardAutogradNestedTensor0 object at 0x7f1f8603cfd0>, contiguous=True) ((s1s16, s1, 1), s1u0) ``` The storage offset of the values component of the returned NJT is `s1u0` where `s1` is known to be an integer. This PR expands the special logic handling the `constant u0` case to handle SymInts as well: `314e08eb52/torch/fx/experimental/symbolic_shapes.py (L1013-L1031)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/142062 Approved by: https://github.com/ezyang ghstack dependencies: #143526	2024-12-19 21:08:04 +00:00
emmettbicker	0b2c47962c	Add support for differentiable LR in SGD + test v2.0 (#143510 ) Second PR in a larger project to broader support for differentiable optimizers with @janeyx99 ! The first one had an issue near the end so this is the second PR on that subject. See #143122 for the development up until this point. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143510 Approved by: https://github.com/janeyx99	2024-12-19 21:04:44 +00:00
Ryan Guo	629de4da60	[dynamo] Add a lint rule to restrict what 3P library one can import (#143312 ) As title, this patch prevents developers from importing third party libraries to patch things in Dynamo, unless there's no other easy workaround (in which case one would add the library to the allowlist in `import_linter.py`, as instructed by the lint error). For instance, if we remove `einops` from the allowlist, we'd get this ```verbatim >>> Lint for torch/_dynamo/decorators.py: Error (IMPORT) Disallowed import importing from einops is not allowed, if you believe there's a valid reason, please add it to import_linter.py 608 \|# Note: this carefully avoids eagerly import einops. 609 \|# TODO: we should delete this whole _allow_in_graph_einops logic by approximately 2024 Q2 610 \|def _allow_in_graph_einops(): >>> 611 \| import einops 612 \| 613 \| try: 614 \| # requires einops > 0.6.1, torch >= 2.0 Error (IMPORT) Disallowed import importing from einops is not allowed, if you believe there's a valid reason, please add it to import_linter.py 612 \| 613 \| try: 614 \| # requires einops > 0.6.1, torch >= 2.0 >>> 615 \| from einops._torch_specific import ( # type: ignore[attr-defined] # noqa: F401 616 \| _ops_were_registered_in_torchdynamo, 617 \| ) 618 \| ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/143312 Approved by: https://github.com/zou3519	2024-12-19 20:59:16 +00:00
bobrenjc93	8e78345d69	remove allow-untyped-defs from distributed/tensor/experimental/__init__.py (#143583 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143583 Approved by: https://github.com/awgu	2024-12-19 20:25:28 +00:00
Thomas Bohnstingl	0a7dba4978	[cond] Change Autograd for cond (#142518 ) Instead of returning None for unused variables, a tensor with all-zeros is returned. Fixes [141301](https://github.com/pytorch/pytorch/issues/141301) Pull Request resolved: https://github.com/pytorch/pytorch/pull/142518 Approved by: https://github.com/ydwu4	2024-12-19 20:09:42 +00:00
bobrenjc93	8850a7b62c	add some logging for tensorify (#143391 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143391 Approved by: https://github.com/jamesjwu	2024-12-19 20:06:26 +00:00
bobrenjc93	25172dc075	remove allow-untyped-defs from torch/ao/quantization/experimental/fake_quantize_function.py (#143582 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143582 Approved by: https://github.com/XuehaiPan, https://github.com/laithsakka	2024-12-19 20:06:22 +00:00
Nichols A. Romero	2d150ad29f	[ROCm] Fix unit test: matmul_offline_mgpu_tunableop (#143507 ) Fixes #141652 This PR contains: - Fix for `matmul_offline_mgpu_tunableop` - Modifications to _checking_tuning_assertions to enable TunableOp if it is disabled. Also moved it into the concurrent futures initializer. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143507 Approved by: https://github.com/jeffdaily	2024-12-19 19:48:20 +00:00
Jack Taylor	66172578f9	[ROCm] Guard triton backend call around cuda.is_available (#143570 ) To resolve: https://github.com/pytorch/test-infra/issues/6082 Calling into Triton's get_backend_options will initialise CUDA and break CPU-only environments that may have hip installed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143570 Approved by: https://github.com/atalman, https://github.com/jeffdaily	2024-12-19 19:46:13 +00:00
Yanbo Liang	c46cfc245f	[Dynamo] Support dict_keys from nested dict object (#143557 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143557 Approved by: https://github.com/williamwen42 ghstack dependencies: #143374, #143547	2024-12-19 19:02:55 +00:00
Yanbo Liang	5fa287aa82	[Dynamo] Rename Dict{View/Keys/Values} to Dict{View/Keys/Values}Variable (#143547 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143547 Approved by: https://github.com/williamwen42 ghstack dependencies: #143374	2024-12-19 19:02:55 +00:00
Nikhil Gupta	4b82251011	[ARM][feat]: Add 4 bit dynamic quantization matmuls & KleidiAI Backend (#134124 ) Description: 1. Quantize Linear Layer Weights to 4-bits: Quantize the weights of the Linear layer to 4 bits, using symmetric quantization. Pack two 4-bit weights into one uint8 container. Choose a quantization scheme (channel-wise or group-wise), with the group size being a multiple of 32. 2. Prepare Quantized Weights, Scales, and Optional Bias: After quantizing, obtain the quantized_weights, scales, and groupsize. If the original Linear layer has a bias, prepare it as well. 3. Pack the Weights Efficiently: Use torch.ops.aten._dyn_quant_pack_4bit_weight to optimally pack the weights, scales, and optional bias. ```python packed_weights = torch.ops.aten._dyn_quant_pack_4bit_weight(weight, scales_and_zeros, bias, groupsize, in_features, out_features) ``` Input parameters should include: in_features and out_features (the same as the Linear layer’s corresponding parameters). 4. Perform Dynamic Quantized Matrix Multiplication: Use torch.ops.aten._dyn_quant_matmul_4bit to perform matrix multiplication with quantized weights. ```python output = torch.ops.aten._dyn_quant_matmul_4bit(input, packed_weights, groupsize, in_features, out_features) ``` Inputs required include: The input tensor, packed_weights , groupsize, and the in_features and out_features. API Usage: https://github.com/pytorch/pytorch/issues/143289 Model Perf : 7B Transformer model: Prefill : 340 t/s Decode : 40 t/s 2B Transformer model Prefill : 747 t/s Decode : 80 t/s Tests: python test/test_linalg.py -k test__dyn_quant_pack_4bit_weight Ran 1 test in 0.016s OK python test/test_linalg.py -k test__dyn_quant_matmul_4bit Ran 8 tests in 0.077s OK python test/test_linalg.py -k test_compile_dyn_quant_matmul_4bit Ran 8 tests in 11.454s Change-Id: Ia1672bad5e6ec94e64d8bb1971395d60f4b3a452 Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/134124 Approved by: https://github.com/digantdesai, https://github.com/malfet	2024-12-19 18:51:26 +00:00
Joel Schlosser	c5ddf5dd90	Unbacked SymInt fixes for subclasses + data-dependent slice() bounds (non-dynamic) (#143526 ) Lifted non-controversial (non-dynamic) fixes from #142062. See description there for context. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143526 Approved by: https://github.com/ezyang	2024-12-19 18:46:36 +00:00
Laith Sakka	2a11472f46	update expected results (#143586 ) update results based on small regression added by `17b71e5d6a` the max we was 1.25%. for sum_floor_div <img width="842" alt="Screenshot 2024-12-19 at 9 04 30 AM" src="https://github.com/user-attachments/assets/6ce913cd-110d-4837-af59-08fb6a0dd12d" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/143586 Approved by: https://github.com/bobrenjc93	2024-12-19 18:43:27 +00:00
William Wen	e1e83015d2	[dynamo, 3.13t] raise error if torch.compile is attempted in 3.13t (nogil) (#143404 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143404 Approved by: https://github.com/colesbury, https://github.com/atalman	2024-12-19 18:10:01 +00:00
Joona Havukainen	33c27be017	Workaround for gather_out in MPS backend (#135543 ) Avoids an underlying issue in reshape op in MPS that gets triggered when the input has multiple dimensions but the shape can be squeezed into 1D. The underlying issue is going to get fixed eventually. Fixes https://github.com/pytorch/pytorch/issues/135240 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135543 Approved by: https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2024-12-19 18:01:01 +00:00
Avik Chaudhuri	1433bad0e4	torch export programming model (#143546 ) Differential Revision: [D67429743](https://our.internmc.facebook.com/intern/diff/D67429743/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143546 Approved by: https://github.com/ydwu4	2024-12-19 16:56:13 +00:00
Tony-Y	61a835ec53	Corrected description of AMSGrad algorithm (#142351 ) Fixes #142323 Pull Request resolved: https://github.com/pytorch/pytorch/pull/142351 Approved by: https://github.com/janeyx99	2024-12-19 16:24:19 +00:00
bobrenjc93	171e6a934f	Don't 1 specialize if stride is contiguous (#143365 ) Fixes: https://github.com/pytorch/pytorch/issues/142024 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143365 Approved by: https://github.com/ezyang	2024-12-19 15:22:47 +00:00
Animesh Jain	465f282a24	[reland][dynamo][guards] Consider tensors as immutable for dict tag matches (#141085 ) Reland - https://github.com/pytorch/pytorch/pull/139560 As mentioned in https://github.com/pytorch/pytorch/pull/130341, using `static py::object` can lead to segfaults. I suspect this is the reason for the import system error seen internally (https://www.internalfb.com/sevmanager/view/469592). In this PR, I am removing the `static` part. This is fine and also the right thing to do because this will catch if user changes the flag in the same process for compiling two different functions. Unfortunately, there is no easy way to trigger this segfault, so I can't write a test. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141085 Approved by: https://github.com/jansel Co-authored-by: William Wen <williamwen@meta.com>	2024-12-19 15:16:10 +00:00
blzheng	288aa87383	[Inductor][CPU] disable bernoulli_p decomposition (#143460 ) Fix https://github.com/pytorch/pytorch/issues/142853 `fallback_random=True` should cause RNG to match between compile/eager (by having compile fall back to eager for RNG ops), but the `bernoulli_p` decompose function is not fully consistent with the eager CPU implementation. We remove the decomp and keep the version for` fallback_random=False`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143460 Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5, https://github.com/jansel	2024-12-19 11:21:35 +00:00
Edward Z. Yang	fd8b217fcd	Pass allow_rhs_unbacked to the stride test in metadata test too (#143040 ) Fixes https://github.com/pytorch/pytorch/issues/142410 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/143040 Approved by: https://github.com/bobrenjc93	2024-12-19 09:37:50 +00:00
Joe Wang	451c233936	leaking c++ singleton specifically (#143509 ) Summary: fix forward for S477887 leaking c++ singleton specifically when c++ shutdown, it tries to destruct the singleton and acquire GIL, at this moment python runtime exists already, causing undefined behavior. Leaking here specifically so that we won't try to destroy singleton at the shutdown phase Test Plan: n/a Differential Revision: D67400633 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143509 Approved by: https://github.com/c-p-i-o	2024-12-19 09:27:07 +00:00
Aaron Orenstein	da06d47bdb	dynamo tracing perf: slight improvement on __instancecheck__: 47.77 -> 47.62 (#143064 ) See #143056 for overall docs. This PR: Switch out an `isinstance()` for an `is` in the very hot `VariableTrackerMeta.__instancecheck__`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143064 Approved by: https://github.com/ezyang, https://github.com/jansel	2024-12-19 09:19:35 +00:00
Aditya Tewari	a97c6a78a8	Upgrade submodule ideep for bf16f32 matmul changes (#143508 ) This change will enable this PR #140159 to pick proper kernels in bf16 mode for SDPA layer. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143508 Approved by: https://github.com/yanbing-j, https://github.com/jgong5	2024-12-19 06:49:16 +00:00
Yanbo Liang	2ffdcab04c	[Dynamo] Add DictKeySetVariable to capture dict_keys passed outside of compiled region (#143374 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143374 Approved by: https://github.com/williamwen42, https://github.com/jansel	2024-12-19 06:39:27 +00:00
Sun, Jiayi	fa1a4a91e9	add batch_size check for max_pool2d_backward (#141657 ) Fix https://github.com/pytorch/pytorch/issues/140923. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141657 Approved by: https://github.com/mingfeima, https://github.com/malfet	2024-12-19 06:01:41 +00:00
mori360	a7ba562ec8	[state dict] Change _load_model_state_dict to enable cpu_offload, accept 2 device type and optimize memory (#142845 ) For destributed state dict api [migration](https://github.com/pytorch/torchtune/pull/2138), make the changes here: 1. `load_from_full_model_state_dict` at TorchTune calls `set_model_state_dict` with the options on whether to have cpu_offload. Add cpu_offload at _load_model_state_dict to process to cpu if config is True 2. Change the device check as lora_finetune might hace 2 device types, accept that to be valid. 3. Some changes to optimize the memory performance: 3.1 use `.detach().clone()` instead of view directly 3.2 if local_state is not meta, copy `full_tensor[slices]` to `ret.to_local()` 4. add relative unit tests Memory performance calling from TorchTune with llama2/7B_full: 1. cpu_offload = True <img width="555" alt="Screenshot 2024-12-18 at 1 36 47 PM" src="https://github.com/user-attachments/assets/429261f5-1107-4592-b295-de3944a2614b" /> 2. cpu_offload = False <img width="555" alt="Screenshot 2024-12-18 at 1 36 52 PM" src="https://github.com/user-attachments/assets/40bf281a-236a-4218-826b-b1192a10c806" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/142845 Approved by: https://github.com/fegin	2024-12-19 05:06:41 +00:00
Sean Xiao	e4301aeaa5	[ODML] Make the ML feature provider thread safe (#143418 ) Summary: This PR is generated from a meta internal Diff, aiming to resolve a crash from a race condition on the dictionary. Test Plan: Build and run Print out the count/name/value of the dictionary and see if the values are get/set/removed correctly. Observe the print statement on app start within IG @diff-train-skip-merge Pull Request resolved: https://github.com/pytorch/pytorch/pull/143418 Approved by: https://github.com/shoumikhin	2024-12-19 04:47:56 +00:00
Valentine233	bf44d5bfb5	[Inductor] move custom pre pass (#143458 ) Fixes #143363. Move `joint_custom_pre` pass after `remove_noop_ops`/`constant_folding`, in order to get the same behavior as `pattern_matcher`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143458 Approved by: https://github.com/jansel, https://github.com/jgong5	2024-12-19 04:41:20 +00:00
Michael Lazos	deb1da15cc	[foreach_map] Add foreach_map Adam impl to compiled optimizer tests (#143454 ) Adds a foreach_map backed Adam to compiled optimizer tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/143454 Approved by: https://github.com/Chillee, https://github.com/eellison	2024-12-19 03:16:47 +00:00
Sergii Dymchenko	19d8bbafb2	Update release matrix for 2.6 (#143538 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143538 Approved by: https://github.com/atalman Co-authored-by: Andrey Talman <atalman@fb.com>	2024-12-19 02:02:04 +00:00
PyTorch MergeBot	14fe1f7190	Revert "[ARM][feat]: Add 4 bit dynamic quantization matmuls & KleidiAI Backend (#134124 )" This reverts commit d3ff2d42c28a2c187cbedfd8f60b84a4dfa2d6bf. Reverted https://github.com/pytorch/pytorch/pull/134124 on behalf of https://github.com/malfet due to This broke S390 builds, includes cpuinfo unconditionally ([comment](https://github.com/pytorch/pytorch/pull/134124#issuecomment-2552560208))	2024-12-19 01:05:11 +00:00
Eddie Yan	2c48af568a	[CUDA][64-bit indexing] Fix some existing problematic `int64_t _ = blockIdx.* * blockDim.` code (#142010 ) `grep` didn't surface any `blockIdx.z blockDim.z` cases ``` git grep -l "int64_t.=.blockIdx.x \* blockDim.x." \| xargs sed -i 's/int64_t $.$ = blockIdx.x \* blockDim.x + threadIdx.x;./int64_t \1 = ((int64_t) blockIdx.x) blockDim.x + threadIdx.x;/g' git grep -l "int64_t.=.blockIdx.x \* blockDim.x." \| xargs sed -i 's/int64_t $.$ = threadIdx.x + blockIdx.x \* blockDim.x;./int64_t \1 = threadIdx.x + ((int64_t) blockIdx.x) blockDim.x;/g' git grep -l "int64_t.=.blockIdx.y \* blockDim.y." \| xargs sed -i 's/int64_t $.$ = blockIdx.y \* blockDim.y + threadIdx.y;./int64_t \1 = ((int64_t) blockIdx.y) blockDim.y + threadIdx.y;/g' git grep -l "int64_t.=.blockIdx.y \* blockDim.y." \| xargs sed -i 's/int64_t $.$ = threadIdx.y + blockIdx.y \* blockDim.y;./int64_t \1 = threadIdx.y + ((int64_t) blockIdx.y) blockDim.y;/g' git grep -l "int64_t.=.blockDim.x \* blockIdx.x." \| xargs sed -i 's/int64_t $.$ = blockDim.x \* blockIdx.x + threadIdx.x;./int64_t \1 = ((int64_t) blockIdx.x) blockDim.x + threadIdx.x;/g' ``` See also https://github.com/pytorch/pytorch/pull/141922/files#r1868262823 in #141999 141922 Pull Request resolved: https://github.com/pytorch/pytorch/pull/142010 Approved by: https://github.com/ngimel	2024-12-19 00:55:11 +00:00
Michael Lazos	b4e0e3bfa3	Backout D66648013 (#143433 ) Summary: backing out https://www.internalfb.com/diff/D66648013 (see comments there for justification) I will reland and disallow the bfloat16 atomics behavior on A100 because it causes a pretty significant performance regression. Test Plan: This is a revert Differential Revision: D67357485 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143433 Approved by: https://github.com/davidberard98	2024-12-19 00:53:49 +00:00
Michael Lazos	5c3996cab2	[Dynamo] topologically sort duplicated graph regions (#143523 ) Ensure regions are topologically sorted Pull Request resolved: https://github.com/pytorch/pytorch/pull/143523 Approved by: https://github.com/williamwen42	2024-12-19 00:43:48 +00:00
Nikita Shulga	55092e1ec5	[BE] Delete `install sccache` step from MacBB (#143512 ) To the best of my knowledge, this step never executed and there were no MacOS binary build running on trunk for a while Pull Request resolved: https://github.com/pytorch/pytorch/pull/143512 Approved by: https://github.com/kit1980, https://github.com/atalman, https://github.com/seemethere ghstack dependencies: #143395, #143511	2024-12-19 00:41:28 +00:00
Nikita Shulga	5e172ea004	[BE] Get rid of `malfet/checkout@silent-checkout` (#143516 ) Instead use `actions/checkout@v4` with `show-progress: false`. It's more verbose than the quiet option, but our logs are long anyway... Partially addresses https://github.com/pytorch/pytorch/issues/143079 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143516 Approved by: https://github.com/atalman, https://github.com/ZainRizvi, https://github.com/huydhn	2024-12-19 00:36:36 +00:00
Richard Barnes	f9da639950	[codemod] Fix a few unused-variable issues in pytorch (#143517 ) Summary: LLVM-15 has a warning `-Wunused-variable` which we treat as an error because it's so often diagnostic of a code issue. Unused variables can compromise readability or, worse, performance. This diff either (a) removes an unused variable and, possibly, it's associated code or (b) qualifies the variable with `[[maybe_unused]]`. - If you approve of this diff, please use the "Accept & Ship" button :-) Test Plan: Sandcastle Reviewed By: palmje Pull Request resolved: https://github.com/pytorch/pytorch/pull/143517 Approved by: https://github.com/mhorowitz	2024-12-19 00:18:08 +00:00
titaiwangms	b23f11c529	[ONNX] Automatically convert dynamic_axes to dynamic_shapes with torch.export.Dim.AUTO (#143158 ) With https://github.com/pytorch/pytorch/pull/133620 introducing Dim.AUTO, we can now automatically convert dynamic_axes to dynamic_shapes without specifying min and max. However, exporting still could be crashed when there are same specs shared between inputs and there is no guarantee that the axes will be dynamic (see PR description). ~~Therefore, a~~ follow-up PR should create a post-processing ONNX side pass to ~~enable the missed dynamic axes~~ rename the dynamic shapes (s0, s1, ...) to dynamic_axes (user setting names). This PR does: (1) Apply torch.export.Dim.AUTO to dynamic_axes when dynamic_shapes is not provided. (2) Convert args/kwargs to tuple inputs, which follows the generated dynamic_shapes format to avoid errors during torch.export.export. (3) Avoid KeyError in _rename_dynamic_shapes_with_model_inputs funtion. (4) Add real world case of a HF model with kv_cache to test on ONNX exporter. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143158 Approved by: https://github.com/xadupre, https://github.com/shubhambhokare1	2024-12-18 23:49:01 +00:00
Shangdi Yu	15a7a0c37e	Remove deprecated branch after capture_pre_autograd_graph fully migrate to training IR (#143228 ) Summary: as title #buildall Test Plan: CI Differential Revision: D67222286 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143228 Approved by: https://github.com/andrewor14	2024-12-18 23:30:45 +00:00
Nikita Shulga	58627fb6bf	[BE] Integrate 5 line build script into template (#143511 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143511 Approved by: https://github.com/kit1980, https://github.com/atalman, https://github.com/seemethere ghstack dependencies: #143395	2024-12-18 23:27:09 +00:00
Michael Lazos	4eafbe5288	[Dynamo] Flatten slices during graph deduplication (#143522 ) I encountered this issue while debugging torchtune - overall we need to make sure to not miss nodes that are slice arguments. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143522 Approved by: https://github.com/williamwen42	2024-12-18 23:12:34 +00:00
Ryan Guo	5380407af5	[dynamo] Properly model root frame globals during inlining (#143447 ) This patch updates `InliningInstructionTranslator.STORE_GLOBAL` to properly check whether `self.f_globals` is the same as root frame `f_globals`. See added comments for why this is important. Fixes #143425. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143447 Approved by: https://github.com/zou3519	2024-12-18 23:04:02 +00:00
Tom Ritchford	d8c8ba2440	Fix unused Python variables in test/[e-z]* (#136964 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136964 Approved by: https://github.com/justinchuby, https://github.com/albanD	2024-12-18 23:02:30 +00:00
William Wen	d298bd840f	[dynamo] add two-point iter test (#143500 ) Implements the last checkbox for https://github.com/pytorch/pytorch/issues/112532. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143500 Approved by: https://github.com/StrongerXi	2024-12-18 22:55:46 +00:00
Nikhil Gupta	d3ff2d42c2	[ARM][feat]: Add 4 bit dynamic quantization matmuls & KleidiAI Backend (#134124 ) Description: 1. Quantize Linear Layer Weights to 4-bits: Quantize the weights of the Linear layer to 4 bits, using symmetric quantization. Pack two 4-bit weights into one uint8 container. Choose a quantization scheme (channel-wise or group-wise), with the group size being a multiple of 32. 2. Prepare Quantized Weights, Scales, and Optional Bias: After quantizing, obtain the quantized_weights, scales, and groupsize. If the original Linear layer has a bias, prepare it as well. 3. Pack the Weights Efficiently: Use torch.ops.aten._dyn_quant_pack_4bit_weight to optimally pack the weights, scales, and optional bias. ```python packed_weights = torch.ops.aten._dyn_quant_pack_4bit_weight(weight, scales_and_zeros, bias, groupsize, in_features, out_features) ``` Input parameters should include: in_features and out_features (the same as the Linear layer’s corresponding parameters). 4. Perform Dynamic Quantized Matrix Multiplication: Use torch.ops.aten._dyn_quant_matmul_4bit to perform matrix multiplication with quantized weights. ```python output = torch.ops.aten._dyn_quant_matmul_4bit(input, packed_weights, groupsize, in_features, out_features) ``` Inputs required include: The input tensor, packed_weights , groupsize, and the in_features and out_features. API Usage: https://github.com/pytorch/pytorch/issues/143289 Model Perf : 7B Transformer model: Prefill : 340 t/s Decode : 40 t/s 2B Transformer model Prefill : 747 t/s Decode : 80 t/s Tests: python test/test_linalg.py -k test__dyn_quant_pack_4bit_weight Ran 1 test in 0.016s OK python test/test_linalg.py -k test__dyn_quant_matmul_4bit Ran 8 tests in 0.077s OK python test/test_linalg.py -k test_compile_dyn_quant_matmul_4bit Ran 8 tests in 11.454s Change-Id: Ia1672bad5e6ec94e64d8bb1971395d60f4b3a452 Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/134124 Approved by: https://github.com/digantdesai, https://github.com/malfet	2024-12-18 22:30:07 +00:00
Huy Do	4717cd1ce9	Skip test_conv2d_linear_add_broadcast_shapes_cpu on fbcode (#143530 ) Summary: The test is added by D67376995 and it is failing on fbcode Test Plan: `buck2 test 'fbcode//mode/opt' fbcode//caffe2/test/inductor:mkldnn_pattern_matcher_cpu -- --exact 'caffe2/test/inductor:mkldnn_pattern_matcher_cpu - test_conv2d_linear_add_broadcast_shapes_cpu (caffe2.test.inductor.test_mkldnn_pattern_matcher.TestPatternMatcher)'` Differential Revision: D67413687 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143530 Approved by: https://github.com/jansel	2024-12-18 22:08:08 +00:00
James	d4ed5941db	Fix floating point literals in IRPrinter (#142119 ) Fixes #114035 This is a recreation of #140002 with approval from its author. Original description: >when v larger than 1e16, the format will be error. example: v is 1.2e17, the output is 1.2e17.f, it have two point '.' Pull Request resolved: https://github.com/pytorch/pytorch/pull/142119 Approved by: https://github.com/jgong5, https://github.com/malfet	2024-12-18 21:59:48 +00:00
Yidi Wu	10b9c5944e	[export] don't decompose custom triton op when exporting (#142426 ) For torch.export (strict and non-strict), we don't do functional decomposition. Instead, we preserve the custom triton ops as custom ops. This is because we want the exported program to be high-level and serializable. #### The alternative: If we decompose the custom op to a functional hop and make it a node in exported program, we need to figure out ways of serializing the hop and its arguments, which can be triton.jited python functions and triton dtypes. This is undesireble because: - it can be tedious to maintain layer that serialize the jited function (e.g. with a string) and dtypes. - changes to triton or the serialization logic for triton arguments can be BC breaking - exported program will expose the implementation detail (i.e. triton source code) for a specific backend (GPU) to users, which mixes levels of abstraction. #### Future plans: After this PR, in the short term, we expect users to have a seperate aot_compile stage that compiles the exported program into a Cubin file on the same machine that users call export, which does autotuning and removes triton dependency and serve the model with Cubin. This guarantees that triton changes won't break BC. In the long term, we may export multiple cubins for the triton op directly. Pull Request resolved: https://github.com/pytorch/pytorch/pull/142426 Approved by: https://github.com/zou3519 ghstack dependencies: #142425	2024-12-18 21:36:28 +00:00
Yidi Wu	1e201422ed	[export] add is_exporting flag (#142425 ) We added an is_export flag under torch.compiler.is_exporting. This comes handy when we try to do some special logic in user-level and system-level (e.g. in upper of the stack). In increasing-scope: - `_is_fx_tracing` is set to True when we use under symbolic_trace or make_fx. - `is_exporting` is set to True when we're doing strict or non-strict export, which internally has a step that calls make_fx and set _is_fx_tracing to be True. - `is_compiling` is set to True when we're either doing strict, non-strict export or torch.compile. Pull Request resolved: https://github.com/pytorch/pytorch/pull/142425 Approved by: https://github.com/avikchaudhuri	2024-12-18 21:36:28 +00:00
Nichols A. Romero	894d47b91b	[ROCm] Fix unit test: matmul_offline_tunableop (#143322 ) Fixes #137936 The PR contains: * Fix for `matmul_offline_tunableop` * Clean-up try-finally blocks in UTs that don't use environment variables (`test_validator_tunableop_rocm`, `test_minimum_tuning_iteration_tunableop`, `test_disable_tuning_tunableop`) * Avoid the use of environment variables in `minimum_tuning_iteration_tunableop` Pull Request resolved: https://github.com/pytorch/pytorch/pull/143322 Approved by: https://github.com/jeffdaily	2024-12-18 20:14:44 +00:00
cyy	255a977494	[1/N] Avoid const_cast (#143169 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/143169 Approved by: https://github.com/albanD	2024-12-18 19:48:01 +00:00
Nikita Shulga	f129bcb5a5	[BE] Refactor argument parsing into its own function (#143395 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143395 Approved by: https://github.com/kit1980, https://github.com/atalman, https://github.com/seemethere	2024-12-18 19:42:49 +00:00
Tom Ritchford	8d4926e30a	Fix unused variables in test/torch.py (#143399 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143399 Approved by: https://github.com/albanD	2024-12-18 17:57:24 +00:00
Sun, Jiayi	863e6e4567	Improve input dimensions check for reflection_pad1d, reflection_pad2d and reflection_pad3d (#141670 ) Fix https://github.com/pytorch/pytorch/issues/141447. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141670 Approved by: https://github.com/mingfeima, https://github.com/malfet	2024-12-18 17:46:26 +00:00
Sun, Jiayi	b588a78ca3	add grad_output shape check for adaptive_max_pool2d_backward and adaptive_max_pool3d_backward (#141663 ) Fix https://github.com/pytorch/pytorch/issues/141099, https://github.com/pytorch/pytorch/issues/141100. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141663 Approved by: https://github.com/mingfeima, https://github.com/malfet	2024-12-18 17:44:27 +00:00
Mark Saroufim	93e8e32708	Remove iOS folder (#143398 ) This folder is a tutorial that is not packaged in PyTorch that's an example of how to use the now deprecated Lite Interpreter People should be using Executorch instead and there's already good documentation on it all over our tutorials and main homepage Testing to see what breaks in CI Pull Request resolved: https://github.com/pytorch/pytorch/pull/143398 Approved by: https://github.com/albanD	2024-12-18 17:25:52 +00:00
Joy Dong	ed9931e6ee	Add tests for non divisible inputs for flex decoding (#143214 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143214 Approved by: https://github.com/drisspg	2024-12-18 16:32:45 +00:00
Bin Bao	0e8013fc1c	[AOTI] Fix a typo in cpp_builder.py (#143351 ) Summary: passthough -> passthrough Pull Request resolved: https://github.com/pytorch/pytorch/pull/143351 Approved by: https://github.com/yushangdi, https://github.com/chenyang78 ghstack dependencies: #143350	2024-12-18 16:28:37 +00:00
Bin Bao	a2092665a9	[AOTI] Refactor path operations in AotCodeCompiler (#143350 ) Summary: Use safer pathlib operation instead of direct string manipulation; Update some path naming to make them more meaningful. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143350 Approved by: https://github.com/yushangdi, https://github.com/chenyang78	2024-12-18 16:28:37 +00:00
Nikita Shulga	24a18d76c8	[MPS] Use metal shaders for all view ops (#143375 ) Before this PR Metal shaders were used to scatter/gather 1-5 dimensional tensors. This PR introduces generalized ones that could be used for any dimensionality and as results gets rid of 700+ lines complex and untested code that might not even work as expected. Generalized gather shader looks as follows ```metal kernel void gather_kernel_n(uint linear_index [[thread_position_in_grid]], constant void * src_ [[buffer(0)]], device void * dst_ [[buffer(1)]], constant uint32_t * size [[buffer(2)]], constant uint32_t * stride [[buffer(3)]], constant uint32_t & numel [[buffer(4)]], constant int32_t & ndim [[buffer(5)]]) {{ if (linear_index >= numel) return; constant {0} * src = (constant {0} )src_; device {1} dst = (device {1} )dst_; uint64_t src_offs = 0; auto src_idx = linear_index; for(int dim = ndim - 1; dim >= 0; --dim) {{ src_offs += stride[dim] (src_idx % size[dim]); src_idx /= size[dim]; }} dst[linear_index] = cast<{1}>(src[src_offs]); }} ``` Which, according to the following benchmark ```python from timeit import default_timer import torch import torch.utils.cpp_extension from torch.utils.benchmark import Measurement, Timer t = Timer( stmt=f"y.copy_(x);torch.mps.synchronize()", setup=f"x=torch.rand(4, 5, 16, 64, 33, 24, dtype=torch.float32, device='mps')[:,:,:,:24,:24,];y=torch.empty(x.shape, device=x.device, dtype=x.dtype)", language="python", timer=default_timer ) print(t.blocked_autorange()) ``` Is almost twice as fast as previous implementation (i.e. on Mac Book M2 Pro it returns 2.9ms for MPS version vs 1.5ms for shader one On MacOS Sequoia [`gatherWithUpdatesTensor: indicesTensor:...`](https://developer.apple.com/documentation/metalperformanceshadersgraph/mpsgraph/gather(withupdatestensor:indicestensor:axis:batchdimensions:name:)?language=objc) crashes if invoked with complex data type, as one can see by running the code below ```swift import Metal import MetalPerformanceShadersGraph func gatherComplexMPS(device: MTLDevice, inp_buf: MTLBuffer, idx_buf: MTLBuffer, out_buf: MTLBuffer, inp_elem: Int, upd_elem: Int) { let graph = MPSGraph() let inputPlaceholder = graph.placeholder(shape: [inp_elem as NSNumber], dataType: .complexFloat32, name: nil) let indicesPlaceholder = graph.placeholder(shape: [upd_elem as NSNumber], dataType: .int64, name: nil) let outNode = graph.gather(withUpdatesTensor: inputPlaceholder, indicesTensor: indicesPlaceholder, axis: 0, batchDimensions: 0, name: nil) let mpsInputBuffer = MPSGraphTensorData(inp_buf, shape: [inp_elem as NSNumber], dataType: .complexFloat32) let mpsIndicesBuffer = MPSGraphTensorData(idx_buf, shape: [upd_elem as NSNumber], dataType: .int64) let mpsOutputBuffer = MPSGraphTensorData(out_buf, shape: [inp_elem as NSNumber], dataType: .complexFloat32) guard let queue = device.makeCommandQueue() else { fatalError("Can't make queue") } graph.run(with: queue, feeds: [inputPlaceholder: mpsInputBuffer, indicesPlaceholder: mpsIndicesBuffer ], targetOperations: nil, resultsDictionary: [outNode: mpsOutputBuffer]) } func makeBufferWithValues<T>(device: MTLDevice, values: [T]) -> MTLBuffer { guard let buf = device.makeBuffer(length: values.count * MemoryLayout<T>.size, options: [.storageModeShared]) else { fatalError("Can't alloc") } let buf_data = buf.contents().assumingMemoryBound(to: T.self) for i in 0..<values.count { buf_data[i] = values[i] } return buf } guard let device = MTLCopyAllDevices().first else { fatalError("Not Metal device found") } print("Using device \(device.name)") let inp_buf = makeBufferWithValues(device: device, values: [1.0, 2.0 , 3.0, 4.0, 5.0, 6.0, 7.0, 8.0]) let idx_buf = makeBufferWithValues(device: device, values: [0, 1, 2, 3]) guard let out_buf = device.makeBuffer(length:8 * MemoryLayout<Float>.size, options: [.storageModeShared]) else { fatalError("Can't alloc") } gatherComplexMPS(device: device, inp_buf: inp_buf, idx_buf: idx_buf, out_buf: out_buf, inp_elem: 4, upd_elem: 4) ``` Fixes https://github.com/pytorch/pytorch/issues/143140 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143375 Approved by: https://github.com/albanD	2024-12-18 16:15:46 +00:00
FFFrog	f47aac6bc2	Make Context to be Device-agnostic Step by Step (3/N) (#137578 ) Detailed Descriptions: - Using unified Device-agnostic API to create new generator for accelerator. - Add deprecated info for GeneratorForPrivateuseone Pull Request resolved: https://github.com/pytorch/pytorch/pull/137578 Approved by: https://github.com/cyyever, https://github.com/ezyang	2024-12-18 15:12:19 +00:00
albanD	80a42399bb	Various fix for memory leak in test autograd and dataloader (#143323 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143323 Approved by: https://github.com/andrewkho, https://github.com/soulitzer ghstack dependencies: #143225	2024-12-18 13:56:59 +00:00
bobrenjc93	84b91ce4a1	remove allow-untyped-defs for torch/_inductor/test_operators.py (#143436 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143436 Approved by: https://github.com/aorenste	2024-12-18 12:54:25 +00:00
Shangdi Yu	d8ea4ce631	[reland] Kill capture_pre_autograd_graph API (#143426 ) Summary: Delete the following API: - capture_pre_autograd_graph() - capture_pre_autograd_graph_using_training_ir() - gm_using_training_ir() Update XLA pin to include https://github.com/pytorch/xla/pull/8398 There's no more call sites to `capture_pre_autograd_graph`. Except 1) two test cases in coreml, guarded by version guard, PR to remove: https://github.com/apple/coremltools/pull/2400 2) a few call sites guarded by version guard (< 2.5.0) Test Plan: CI Differential Revision: D67354440 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143426 Approved by: https://github.com/gmagogsfm	2024-12-18 12:07:09 +00:00
Zizeng Meng	eb67dd3e2d	[3/N][Memory Profiling] Add memory profiling function for MTIA hooks (#142149 ) Design Doc: https://fburl.com/gdoc/47zpuweb Prototyping: D66469341 In this diff, we implement two new mtia hooks to start/stop profiler and export the memory snapshot. In next diff, we will integrate the mtia backend with profiler python api Differential Revision: [D66823583](https://our.internmc.facebook.com/intern/diff/D66823583/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/142149 Approved by: https://github.com/nautsimon	2024-12-18 11:58:23 +00:00
Tom Ritchford	993b2f0ee0	Fix unused variables in test/test_transformers.py (#143407 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143407 Approved by: https://github.com/drisspg	2024-12-18 09:59:24 +00:00
bobrenjc93	8dd380803c	remove allow-untyped-defs for torch/_functorch/batch_norm_replacement.py (#143438 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143438 Approved by: https://github.com/oulgen	2024-12-18 09:01:06 +00:00
bobrenjc93	75fe5a3ef7	remove allow-untyped-defs for torch/fx/experimental/debug.py (#143439 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143439 Approved by: https://github.com/oulgen	2024-12-18 08:55:46 +00:00
bobrenjc93	03991798ca	remove allow-untyped-defs for torch/nn/parallel/__init__.py (#143437 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143437 Approved by: https://github.com/oulgen	2024-12-18 08:50:37 +00:00
Aidyn-A	a99536480d	[ATen][Native][Special] Hermite polynomial prematurely return NaN if n is high (#141955 ) Hermite polynomials diverge to NaN at high orders due to numerical overflow. The proposal is to prematurely return NaN of it is known that at this value it will be NaN. According to my short test ```Python import torch device = "cuda" dtype = torch.float32 x = torch.linspace(-1000, 1000, 100000, device=device, dtype=dtype) for n in range(1024): if torch.special.hermite_polynomial_h(x, n).isnan().sum().item() == x.shape[0]: print(f"hermite_polynomial_h: all outputs are nans! n = {n}") break for n in range(1024): if torch.special.hermite_polynomial_he(x, n).isnan().sum().item() == x.shape[0]: print(f"hermite_polynomial_he: all outputs are nans! n = {n}") break ``` The output values become NaNs at these orders: ``` hermite_polynomial_h: all outputs are nans! n = 53, dtype=torch.float32 hermite_polynomial_he: all outputs are nans! n = 61, dtype=torch.float32 hermite_polynomial_h: all outputs are nans! n = 272, dtype=torch.float64 hermite_polynomial_he: all outputs are nans! n = 304, dtype=torch.float64 ``` Surely, it makes sense to increase the limit as a safety margin. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141955 Approved by: https://github.com/malfet, https://github.com/eqy	2024-12-18 08:30:08 +00:00
Sheng Fu	2ea4b56ec8	Record min/max of integral tensor in ET (#143088 ) Summary: In et-replay, random data is used to run the operators. However, it does not work well for the op that uses index to access tensor. For example, embedding ops, which use the indices to look up the embedding table. If random data is used for these index ops, et-replay usually runs into invalid memory access issue. To fix it, ET provides an environment variable "ENABLE_PYTORCH_EXECUTION_TRACE_INTEGRAL_TENSOR_RANGE", if it is set, ET will capture the min/max value of the flattened integral tensor. Then in et_replay, the min/max is used to generate the random tensor within that range. It fixed invalid memory access issue. Test Plan: buck2 run mode/opt caffe2/test:test_profiler_cuda -- profiler.test_execution_trace.TestExecutionTraceCUDA.test_execution_trace_record_integral_tensor_range_cuda Differential Revision: D66666931 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143088 Approved by: https://github.com/sanrise	2024-12-18 08:20:35 +00:00
Avik Chaudhuri	bceedeec2b	fix checking non-trivial input constraints (#143442 ) A bunch of auto dynamic shape tests would fail non-strict retraceability because when checking input constraints, we'd compare non-trivial expressions, which would require / affect shape env. ``` ... is not tracked with proxy for <torch.fx.experimental.proxy_tensor._ModuleStackTracer object ... ``` I've also observed this bug internally. This PR does an early check on whether args passed have concrete shapes, and only then proceeds: as before, we 1. try to unify / solve with the arg dim when the corresponding placeholder node dim is symbolic in one symbol 2. check directly if the placeholder node dim is concrete 3. otherwise defer to run time. Differential Revision: [D67359596](https://our.internmc.facebook.com/intern/diff/D67359596/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143442 Approved by: https://github.com/tugsbayasgalan	2024-12-18 07:29:08 +00:00
qiurc	90cc43f270	Support garbage collection after pt2 compilation (#143364 ) Summary: Support garbage collection after pt2 compilation. Add jk to control the global rollout / rollback of this functionality Add env var to control individual job's rollout Test Plan: Test the model training job with / without this changes Reviewers: @yuxihu @ezyang , @Yuzhen11 , Subscribers: Tasks: Tags: Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/143364 Approved by: https://github.com/ezyang	2024-12-18 07:25:11 +00:00
Rachel Guo	9275091d6e	[provenance_tracking] Dump inductor_triton_kernel_to_post_grad_nodes.json info in debug_trace (#143055 ) Summary: This diff mainly adds code changes to dump `inductor_triton_kernel_to_post_grad_nodes.json` artifact which contains mapping info from post_grad -> inductor kernel code: `{"inductor_triton_kernel_name": [post_grad_node_0, post_grad_node_1, ..., ], "..."}.` Example paste: P1695235000 verified on the test model. See "Test Plan": We use this artifact to demonstrate provenance tracking in the frontend 3-tab highlighter tool: https://github.com/YUNQIUGUO/compiler_explorer (copy/pasted the input files for demo purpose for now and will integrate with Shangdi's tool to 4-tab) https://pxl.cl/66BzK Note: Currently only supports mapping for inductor's`TritonKernel` type. TODO for enhancing more support for `ExternKernel` and other inductor generated kernel type, etc. Test Plan: test_model_coverage.sh: ``` #!/bin/sh MODEL_ENTITY_ID=644688112 SNAPSHOT_ID=32 MODULE=merge # buck2 build --show-output mode/opt -c=python.package_style=inplace -c fbcode.enable_gpu_sections=true -c fbcode.platform=platform010 -c fbcode.split-dwarf=true -c fbcode.nvcc_arch=a100,h100 caffe2/torch/fb/model_transform/experimental/benchmark:mts_gpu_benchmark TORCH_COMPILE_DEBUG=1 CUDA_VISIBLE_DEVICES=0 TORCHINDUCTOR_FORCE_DISABLE_CACHES=1 TORCH_LOGS="+inductor, schedule, fusion, output_code" TORCH_TRACE="tmp/guorachel_tt" TORCHINDUCTOR_MAX_AUTOTUNE=1 TORCHINDUCTOR_UNIQUE_KERNEL_NAMES=1 ../buck-out/v2/gen/fbcode/d29ee94b913014f1/caffe2/torch/fb/model_transform/experimental/benchmark/__mts_gpu_benchmark__/mts_gpu_benchmark.par --model-path manifold://ads_storage_fblearner/tree/user/facebook/fblearner/predictor/${MODEL_ENTITY_ID}/${SNAPSHOT_ID}/gpu_lowering/input.predictor.disagg.gpu.merge --lower-backend AOT_INDUCTOR_EP --gpu-trace --aot-inductor-config="{'max_autotune': True}" 2>&1 \| tee output.txt ``` {F1973765026} ``` buck2 test 'fbcode//mode/opt' fbcode//caffe2/test/inductor:provenance_tracing -- --exact 'caffe2/test/inductor:provenance_tracing - test_triton_kernel_post_grad_mapping_aot_inductor (caffe2.test.inductor.test_provenance_tracing.TestProvenanceTracingArtifact)' ``` ``` TORCH_LOGS="+inductor, output_code" buck2 run -c fbcode.enable_gpu_sections=true -c fbcode.nvcc_arch=h100 @//mode/opt fbcode//caffe2/test/inductor:provenance_tracing -- -r test_triton_kernel_post_grad_mapping_aot_inductor ``` Differential Revision: D66967510 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143055 Approved by: https://github.com/chenyang78	2024-12-18 06:51:50 +00:00
Digant Desai	6829897682	Remove assert from partitioner.py (#143376 ) Remove erroneous assert assuming a dependent (user) node to be in the partition. This partially reverts #136616 by removing the assert. Tested locally with a failing ExecuTorch Arm test using ``` $ python -m examples.arm.aot_arm_compiler --model_name mv2 --target ethos-u55-128 --delegate --quantize ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/143376 Approved by: https://github.com/tarun292	2024-12-18 06:08:19 +00:00
Bert Maher	6715a8858a	Triton bump for 3.2 cherry-picks (device context) (#143409 ) Summary: * https://github.com/triton-lang/triton/pull/3731 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143409 Approved by: https://github.com/atalman	2024-12-18 05:17:29 +00:00
Shangdi Yu	c17a07ade3	Add float8 support in serde schema (#143343 ) Summary: Fix https://github.com/pytorch/pytorch/issues/141316 Bump up schema minor version. as title, add float8 support in serde schema Test Plan: ``` buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test:test_export -- -r test_serialize_float8 ``` Differential Revision: D67307670 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143343 Approved by: https://github.com/yiming0416	2024-12-18 05:07:21 +00:00
emmettbicker	576789197a	Add support for CPU scalar in addcmul (#143264 ) Step required for performance in #143122 Adds support for CPU scalar for tensor_2 in addcmul. For example: ``` import torch a = torch.rand(2, 2, device="cuda") b = torch.tensor(1e-3) torch.add(a, b) torch.addcmul(a, a, b) # used to fail, now works ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/143264 Approved by: https://github.com/janeyx99 Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com>	2024-12-18 04:43:29 +00:00
Natalia Gimelshein	859be14c4e	fix a few int64_t index computations, fix complex128 scan that had to… (#143401 ) …o few threads per title Pull Request resolved: https://github.com/pytorch/pytorch/pull/143401 Approved by: https://github.com/eqy	2024-12-18 04:27:27 +00:00
Tom Ritchford	c947a7d38e	Fix unused Python variables in test/nn (#143396 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143396 Approved by: https://github.com/mikaylagawarecki	2024-12-18 03:30:54 +00:00
bobrenjc93	17a6d4b882	remove allow-untyped-defs for torch/_export/passes/remove_runtime_assertions.py (#143435 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143435 Approved by: https://github.com/oulgen	2024-12-18 03:05:20 +00:00
Nikita Shulga	a9de6a68f4	[CD] Test that all PyTorch wheels support OpenMP (#143394 ) Together with https://github.com/pytorch/pytorch/pull/143393 fixes https://github.com/pytorch/pytorch/issues/123225 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143394 Approved by: https://github.com/atalman ghstack dependencies: #143393	2024-12-18 02:27:55 +00:00
atalman	2400db115c	Use Manylinux 2.28 for nightly build and cxx11-abi (#143423 ) As per: https://dev-discuss.pytorch.org/t/pytorch-linux-wheels-switching-to-new-wheel-build-platform-manylinux-2-28-on-november-12-2024/2581 Linux Builds: CPU, CUDA 11.8, CUDA 12.4 switched to Manylinux 2.28 and D_GLIBCXX_USE_CXX11_ABI=1 on the week of Dec 16 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143423 Approved by: https://github.com/huydhn, https://github.com/malfet, https://github.com/seemethere	2024-12-18 02:02:58 +00:00
eellison	e890d67543	Use process pool for precompilation of triton templates (#142450 ) Perf results: https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Tue%2C%2003%20Dec%202024%2022%3A57%3A51%20GMT&stopTime=Tue%2C%2010%20Dec%202024%2022%3A57%3A51%20GMT&granularity=hour&mode=inference&dtype=bfloat16&deviceName=cuda%20(a100)&lBranch=gh/eellison/740/head&lCommit=b925256c29ec43e1933e4ede94b16d1f404b595f&rBranch=gh/eellison/740/base&rCommit=a161d6362f7d9db773322d2ce2a3a70aabbecf4b Training: <img width="793" alt="image" src="https://github.com/user-attachments/assets/75f5bc0d-8005-4213-ae88-0b94fb187dfc" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/142450 Approved by: https://github.com/jansel	2024-12-18 01:48:04 +00:00
Sun, Jiayi	c06b5048ba	[Inductor] Fix _can_be_inplace function (#143279 ) Summary: Modify _can_be_inplace function: return False if `_other.data` is an instance of `ir.BaseView`. Fix https://github.com/pytorch/pytorch/issues/143280. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143279 Approved by: https://github.com/leslie-fang-intel, https://github.com/jansel, https://github.com/jgong5	2024-12-18 00:26:05 +00:00
Mikayla Gawarecki	6cd96f069b	Add warning to torch.jit.load (#143403 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143403 Approved by: https://github.com/albanD ghstack dependencies: #143326	2024-12-18 00:17:41 +00:00
Mikayla Gawarecki	ac8342f881	Prevent torch.jit.load path in torch.load when weights_only=True (#143326 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143326 Approved by: https://github.com/albanD	2024-12-18 00:17:41 +00:00
soulitzer	13a5c15ef5	Fix sample inputs leaked from subtest (#143415 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143415 Approved by: https://github.com/jbschlosser ghstack dependencies: #143333	2024-12-18 00:15:18 +00:00
soulitzer	3f99682fbd	NJT linear_backward should not return inner tensor as-is (#143333 ) Fixes debug=1 use-count checks https://github.com/pytorch/pytorch/actions/runs/12187808902/job/34002323481#step:22:2521 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143333 Approved by: https://github.com/jbschlosser	2024-12-18 00:15:18 +00:00
Felix Su	feb4818bc9	[SJD] adding kill logic for current process when killing a worker (#141060 ) Summary: we have seen cases where some workers don't receive stop signals, meaning watchdog isn't stopped accordingly. this diff introduces logic to kill the current pid alongside the worker pid something to note is that there is a case where the worker pid to be killed either doesn't exist or cannot be killed for some reason which will result in the current pid also not being killed. this seems okay since the watchdog loop will just attempt to kill the worker pid on the next iteration but just wanted to point this out Test Plan: experiment in next diff shows this works Differential Revision: D65837085 Pull Request resolved: https://github.com/pytorch/pytorch/pull/141060 Approved by: https://github.com/gag1jain	2024-12-18 00:13:02 +00:00
Hyunho Yeo	efe21ee59d	[MTIA] (3/n) Implement PyTorch APIs to query/reset device peak memory usage (#143347 ) Summary: This diff implements the "max_memory_allocated" PyTorch API for MTIA devices, which returns the peak device DRAM usage Test Plan: Passed the local unit test ``` buck2 test //mtia/host_runtime/torch_mtia/tests:test_torch_mtia_api -- -r test_max_memory_allocated ``` https://www.internalfb.com/intern/testinfra/testrun/8444249544807192 Reviewed By: yuhc, egienvalue Differential Revision: D67118173 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143347 Approved by: https://github.com/nautsimon	2024-12-17 23:37:03 +00:00
Aleksei Nikiforov	a040006da7	Force symlink creation when building python on s390x (#143195 ) Sometimes it exists already when building on s390x This change should fix docker image build on s390x. Example of error can be found here: https://github.com/pytorch/pytorch/actions/runs/12282230596/job/34365267303 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143195 Approved by: https://github.com/ezyang	2024-12-17 23:01:47 +00:00
Nikita Shulga	2642bbc6dc	[CD] Run smoke tests on MacOS wheel (#143393 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143393 Approved by: https://github.com/atalman, https://github.com/seemethere	2024-12-17 22:47:07 +00:00
Eli Uriegas	b247f87845	tools: Add a tool to build wheels for multiple python versions (#143361 ) Adds a tool to build bdist_wheels sequentially for multiple different python versions (if specified). The goal of this tool is to eventually be able to utilize this in our binary build runs to significantly reduce the amount of time we take to build packages by utilizing a local ccache from the first build. Tested locally using the following: ``` $ ccache -C # clear cache # -p could actually reference any python interpreter $ python tools/packaging/build_wheel.py \ -p /home/eliuriegas/.local/share/uv/python/cpython-3.12.7-linux-x86_64-gnu/bin/python3.12 \ -p /home/eliuriegas/.local/share/uv/python/cpython-3.13.0-linux-x86_64-gnu/bin/python3.13 \ -d dist-multi/ ... 2024-12-17 10:48:11,365 - INFO - Build time (3.12.7): 571.440689s 2024-12-17 10:48:11,365 - INFO - Build time (3.13.0): 191.147503s ``` Signed-off-by: Eli Uriegas <eliuriegas@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/143361 Approved by: https://github.com/malfet, https://github.com/atalman	2024-12-17 21:56:06 +00:00
Tristan Rice	1e058a8f38	FileTimerClient: add retry logic on connect (#143318 ) Fixes #143188 The fifo server binds from a thread -- under rare cases the client connects before the server thread starts. This adds a retry when opening the fifo socket in non-blocking mode. This will wait up to 1s for the server to start which balances fast error messages while still providing some wiggle room on the server side. Test plan: ``` pytest --minutes 10 test/distributed/elastic/timer/file_based_local_timer_test.py -k test_watchdog_call_count -x ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/143318 Approved by: https://github.com/fegin	2024-12-17 21:48:30 +00:00
Manav Avlani	aabe285aaf	Add 2 more APIs to the exposed public torch python APIs (#143380 ) These two APIs are being used internally for some projects and need to be exposed as the build for this is done using OSS toolchain. `af8789c056` - this change hid most apis in torch python barring the ones explicitly specified breaking the build. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143380 Approved by: https://github.com/suo	2024-12-17 21:16:51 +00:00
Chirag Pandya	0bdc173ab6	[fr] recognize all_reduce_barrier as a valid op (#143354 ) Summary: D67068632 introduced a better profiling name for barrier operations to be able to distinguish various ops. Unfortunately, this broke Flight Recorder Analysis with the following error as reported by dmwu ``` fr_trace -m torchx-param_bench_16g_mi300x-all_to_all -a 0 --mast_job_version 98 -w 16 Traceback (most recent call last): File "/usr/local/fbcode/platform010/lib/python3.10/runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "/usr/local/fbcode/platform010/lib/python3.10/runpy.py", line 86, in _run_code ``` Test Plan: Test manually. Differential Revision: D67305997 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143354 Approved by: https://github.com/wconstab	2024-12-17 21:09:18 +00:00
Michael Lazos	a96387a481	[Dynamo] only import einops if version is lower than 0.7.0 (#142847 ) Fixes internal xref (https://fb.workplace.com/groups/257735836456307/posts/804793021750583/?comment_id=805229281706957&reply_comment_id=805232695039949) Pull Request resolved: https://github.com/pytorch/pytorch/pull/142847 Approved by: https://github.com/zou3519	2024-12-17 20:50:25 +00:00
Richard Barnes	9283c40ba8	[codemod] Decorate unused variables with `[[maybe_unused]]` (#143381 ) Summary: LLVM-15 has a warning `-Wunused-variable` which we treat as an error because it's so often diagnostic of a code issue. Unused variables can compromise readability or, worse, performance. This diff either (a) removes an unused variable and, possibly, it's associated code or (b) qualifies the variable with `[[maybe_unused]]`. - If you approve of this diff, please use the "Accept & Ship" button :-) Test Plan: Sandcastle Reviewed By: palmje Pull Request resolved: https://github.com/pytorch/pytorch/pull/143381 Approved by: https://github.com/malfet	2024-12-17 20:36:03 +00:00
bobrenjc93	7c25a55c65	clean up type nits on torch/jit/_ir_utils.py (#143371 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143371 Approved by: https://github.com/laithsakka	2024-12-17 20:28:07 +00:00
Catherine Lee	de4a555c82	Run inductor-rocm workflow on ciflow/inductor (#143205 ) The paths are almost the same as ciflow/inductor. The only differences I could spot where that ciflow/inductor also has `test/dynamo/` and `torch/csrc/dynamo/` This is to prevent failures like https://github.com/pytorch/pytorch/actions/runs/12304985383/job/34345585535 which fails due to running on a fork, which cannot set the id token. The other option to prevent this is to stop the job from running when on a fork. If someone adds both labels, one will be cancelled because they have the same concurrency group Pull Request resolved: https://github.com/pytorch/pytorch/pull/143205 Approved by: https://github.com/huydhn	2024-12-17 20:09:48 +00:00
Joy Dong	b16f020edd	Add flex attention kernel parameter tuning options (#139639 ) 1. Add `num_warps` and `num_stages` to kernel parameters of `flex_attention`. This allows performance tuning when the default parameters of `flex_attention` is suboptimal, for example for `document_masks`. 2. Update how flex decoding splits are assigned to threadblocks. The first split of full blocks are assigned to the first threadblock, and the first split of partial blocks are assigned to the last threadblock. 3. Update `get_split_k` to assign 2 splits per SM before we have runtime workload balancing based on BlockMask. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139639 Approved by: https://github.com/drisspg	2024-12-17 19:31:40 +00:00
Catherine Lee	e3c53fb1bc	Increase sharding for debug build (#143327 ) It started timing out consistently and takes 3+ hours per shard I assume its just that we slowly increase tests over time since I cannot find a dramatic jump recently Pull Request resolved: https://github.com/pytorch/pytorch/pull/143327 Approved by: https://github.com/wdvr, https://github.com/huydhn	2024-12-17 19:27:51 +00:00
Chong Gu	5b5d7016c8	Remove stable_partition for ARM AOTI Runtimes (#142394 ) Summary: This function call will cause OOM issues on ARM machines with multi-threaded predictors (reason behind this is still being investigated), we replace it with the standard partition instead. Differential Revision: D66904296 Pull Request resolved: https://github.com/pytorch/pytorch/pull/142394 Approved by: https://github.com/frank-wei	2024-12-17 19:19:04 +00:00
Aaron Orenstein	e7704f41ca	Simplify _compute_symbolic_stride() (#138844 ) Rewrite _compute_symbolic_stride() to make it simpler and faster. The existing code involves several inner loops in an attempt to process the common case faster - but in reality this effort is actually slower than the simpler code. Testing: The initial version of this PR (which passed all tests) ran both the old algorithm and new algorithm and compared the results to make sure that results were substantially the same (they weren't the same simply because the algorithm allocates new dynamic symbols as part of it). I also measured the timing of both methods and from the cases I checked the simpler algorithm was generally about 30% faster (which was usually the "fast path" of the old algorithm). Pull Request resolved: https://github.com/pytorch/pytorch/pull/138844 Approved by: https://github.com/bobrenjc93 ghstack dependencies: #138843	2024-12-17 19:16:53 +00:00
Aaron Orenstein	63cb5e4ade	Move inner loop of _create_symbolic_sizes_strides_storage_offset into its own method (#138843 ) Making the next PR easier to review: - move the inner loop of _create_symbolic_sizes_strides_storage_offset() into a separate function - fix lintrunner lints Pull Request resolved: https://github.com/pytorch/pytorch/pull/138843 Approved by: https://github.com/ezyang	2024-12-17 19:16:53 +00:00
eellison	f3ec59d44c	Fix non-dense inductor effn attn bias (#141905 ) Didn't have any luck making local repro, partially because https://github.com/pytorch/pytorch/issues/141888 which will be fixed when we update to triton 3.2. but verified locally it fixes https://github.com/pytorch/pytorch/issues/139424 with the triton pin update that is landing soon Pull Request resolved: https://github.com/pytorch/pytorch/pull/141905 Approved by: https://github.com/drisspg ghstack dependencies: #143315	2024-12-17 18:55:50 +00:00
Tom Ritchford	1e9ec51431	Fix unused variables in test_serialize_sym_float (#143389 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143389 Approved by: https://github.com/Skylion007	2024-12-17 18:55:14 +00:00
William Wen	18261e9f39	[dynamo] implement framelocals mapping as c++ object (#140063 ) Implements https://github.com/pytorch/pytorch/issues/93753 - move frame local guard accessors to C++. Before, we used dict accessors on a Python dict representing the frame's fastlocals that we manually build. We move this accessor to C++ and additionally use the fastlocal index whenever possible. Some implementation notes: - `FrameLocalsMapping` is now initialized as a C++ vector of `PyObject`s. We do not just use the frame's localsplus/fastlocals buffer because we also unbox cells. - `FrameLocalsMapping` can still be converted into a Python dict representing the frame's fastlocals, but it is done lazily. - We update `LeafGuard`, `GuardAccessor`, and `GuardManager`'s `check_nopybind` methods to accept `FrameLocalsMapping`. By default, we convert the `FrameLocalsMapping` to a Python dict and run the original `check_nopybind` on it, but in some cases, conversion is not needed. - We add a new guard accessor `FrameLocalsGuardAccessor`, which is similar to `DictGetItemGuardAccessor` but has special handling for `FrameLocalsMapping`. We create a separate class to emphasize different use cases, but we could probably combine these two (can do in a follow up) dynamo_guard_eval.py microbenchmark update: - 713.2us -> 630.0us (3.10) - 598.8us -> 530.7us (3.12) Other followups: - Add `FrameLocalsMapping` version for `check_verbose_nopybind` in order to match behavior between `check_nopybind` and `check_verbose_nopybind`. This can prevent difficult debugging situations where guards fail (`check_nopybind` returns false) but no guard error message is generated (`check_verbose_nopybind` succeeds). - Rewrite the `SHAPE_ENV` guard into C++ - it is a fairly common guard that results in `FrameLocalsMapping` needing to convert to a dict Pull Request resolved: https://github.com/pytorch/pytorch/pull/140063 Approved by: https://github.com/jansel ghstack dependencies: #142117, #142430	2024-12-17 18:54:27 +00:00
William Wen	c04f0bb7b9	[dynamo] add benchmark for guard eval (#142430 ) Benchmarks: - 713.2us (3.10) - 598.8us (3.12) Pull Request resolved: https://github.com/pytorch/pytorch/pull/142430 Approved by: https://github.com/jansel ghstack dependencies: #142117	2024-12-17 18:54:27 +00:00
William Wen	97ca09f692	[dynamo] format eval_frame.c (#142117 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/142117 Approved by: https://github.com/jansel	2024-12-17 18:54:27 +00:00
bobrenjc93	53e4d7b6a2	remove allow-untyped-defs for torch/_lazy/device_context.py (#143367 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143367 Approved by: https://github.com/aorenste ghstack dependencies: #143366	2024-12-17 18:54:03 +00:00
eellison	bcc93a1e8e	remove nonowninglayout special case in require strides (#143315 ) NonOwningLayout is always constructed to a FixedLayout. We should handle it the same way as FixedLayout. Note - this case is very rare, I added an assertion here and no test/model failed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143315 Approved by: https://github.com/zou3519	2024-12-17 18:47:38 +00:00
Bin Bao	a3688ead4b	[AOTI][doc] Update tutorial (#143390 ) Summary: Update the cpp inference part to call AOTIModelPackageLoader.run directly Pull Request resolved: https://github.com/pytorch/pytorch/pull/143390 Approved by: https://github.com/yushangdi	2024-12-17 18:35:40 +00:00
chuanqiw	fa4db62968	[CI] Unify the XPU Windows CICD installtion scripts (#143185 ) Follow https://github.com/pytorch/pytorch/pull/142156 Works for https://github.com/pytorch/pytorch/issues/114850 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143185 Approved by: https://github.com/atalman	2024-12-17 18:26:19 +00:00
bobrenjc93	74e66a21b4	remove allow-untyped-defs for torch/_C/_distributed_autograd.pyi (#143369 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143369 Approved by: https://github.com/aorenste	2024-12-17 18:09:28 +00:00
Benjamin Glass	37a1b9efcc	[export] Serialize all dataclass fields (#142286 ) Reverts a change in #121337. All dataclass members must be serialized, even default-valued members, because downstream code often implicitly assumes their presence. This PR fixes a segfault when running `test_custom_op_all_inputs` from `test/inductor/test_aot_inductor_custom_ops.py`. This segfault was caused by querying for an "index" field for the `Device` type (see `torch/csrc/inductor/aoti_torch/oss_proxy_executor.cpp:136`), which was previously skipped when serializing if the device index was unspecified. A number of other structs which are deserialized in this file also contain optional fields, and presumably could experience the same bug. Fixes #138955 Fixes #134793 Pull Request resolved: https://github.com/pytorch/pytorch/pull/142286 Approved by: https://github.com/zhxchen17 ghstack dependencies: #142175	2024-12-17 17:21:27 +00:00
Benjamin Glass	bb06fc79fb	cpp_builder: handle CUDA lib paths involving "stubs" in more circumstances (#142175 ) conda packages for `cuda-driver-dev=12.4.127` use a "stubs" subdirectory to contain `libcuda.so`. This was previously only handled by cpp_builder in some cases, but now needs to be potentially handled more generally. Pull Request resolved: https://github.com/pytorch/pytorch/pull/142175 Approved by: https://github.com/desertfire	2024-12-17 17:21:27 +00:00
PyTorch MergeBot	e3d754419f	Revert "[reland][dynamo][guards] Consider tensors as immutable for dict tag matches (#141085 )" This reverts commit 1bf983077f9f9c19e20dac178aa764b4620d78e7. Reverted https://github.com/pytorch/pytorch/pull/141085 on behalf of https://github.com/huydhn due to The diff D66211131 has been commandeered internally and is it not part of the train anymore. If codev is needed, pls reland this accordingly ([comment](https://github.com/pytorch/pytorch/pull/141085#issuecomment-2549092225))	2024-12-17 17:21:14 +00:00
bobrenjc93	ec02ae4345	remove allow-untyped-defs for torch/utils/benchmark/examples/simple_timeit.py (#143368 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143368 Approved by: https://github.com/aorenste	2024-12-17 17:19:11 +00:00
bobrenjc93	313b9964ae	remove allow-untyped-defs for torch/_C/_lazy.pyi (#143370 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143370 Approved by: https://github.com/aorenste, https://github.com/desertfire ghstack dependencies: #143366	2024-12-17 17:18:10 +00:00
Guilherme Leobas	487343346e	Prevent users from seeing hardcoded print stmt when hypothesis is not installed (#142398 ) Fixes: #142357 Pull Request resolved: https://github.com/pytorch/pytorch/pull/142398 Approved by: https://github.com/zou3519	2024-12-17 16:59:05 +00:00
PyTorch MergeBot	969b07b96f	Revert "[ROCm] CK Flash Attention Backend (#138947 )" This reverts commit 500d02921bcf1619e268196866ddf099a4b94080. Reverted https://github.com/pytorch/pytorch/pull/138947 on behalf of https://github.com/atalman due to Breaks default windows checkout ([comment](https://github.com/pytorch/pytorch/pull/138947#issuecomment-2548998359))	2024-12-17 16:46:57 +00:00
bobrenjc93	cd7de1f4fa	remove allow-untyped-defs for torch/masked/maskedtensor/creation.py (#143321 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143321 Approved by: https://github.com/laithsakka	2024-12-17 16:44:50 +00:00
Bin Bao	4d90c487d8	[AOTI] Add is_big_gpu checking to test_conv3d (#143339 ) Summary: test_conv3d tests max-autotune, which is only supported for big_gpu. Differential Revision: D67306331 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143339 Approved by: https://github.com/BoyuanFeng	2024-12-17 16:18:45 +00:00
albanD	792f1c47e9	No actual change, just remove variable contain Tensors from global scope (#143225 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143225 Approved by: https://github.com/ezyang	2024-12-17 16:14:25 +00:00
Joona Havukainen	afa313e669	Extend bmm tiling to work up to 2^32 elem in any single output dim (#143095 ) The previous tiling implementation worked for up to 2^32 total elements per single batch entry. This extends the functionality to support the dimensions encountered in ComfyUI (output shape: 1,72250,72250). Fixes #141909 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143095 Approved by: https://github.com/kulinseth	2024-12-17 16:03:46 +00:00
Jackson	340f02c49b	make it clearer (in docs) one can double decorate with torch.library.impl_* APIs (#137608 ) Fixes #120503. Fix originally attempt by @soxand16 with PR: https://github.com/pytorch/pytorch/pull/121469. PR was almost ready to merge, but then went stale (over 6 months old). This PR implements original fix with refactoring for clarity. CC: @zou3519 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137608 Approved by: https://github.com/zou3519	2024-12-17 15:13:58 +00:00
Yuanhao Ji	6bbbb08458	[Dynamo] Replace `torch._dynamo.optimize()` with `torch.compile()` [10/N] (#142451 ) > This is the last one related commits: - #139706 - #140238 - #140247 - #140253 - #140663 - #140688 - #140922 - #140924 - #140933 - #142451 Pull Request resolved: https://github.com/pytorch/pytorch/pull/142451 Approved by: https://github.com/bdhirsh	2024-12-17 12:18:29 +00:00
Shunting Zhang	34a0d8b62e	[inductor] invalidate pointwise dep cache for LOAF (#141160 ) Fixes https://github.com/pytorch/pytorch/issues/141134 Pull Request resolved: https://github.com/pytorch/pytorch/pull/141160 Approved by: https://github.com/vkuzo	2024-12-17 09:51:29 +00:00
drisspg	5160a725c8	[FlexAttention] Fix broken eager tracing (#143344 ) Fixes #143331 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143344 Approved by: https://github.com/Chillee ghstack dependencies: #143299	2024-12-17 09:42:36 +00:00
Jason Ansel	cf46eb3bf5	[inductor] Include types and size hints in MultiKernel cache key (#142349 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/142349 Approved by: https://github.com/eellison, https://github.com/shunting314	2024-12-17 09:26:38 +00:00
Richard Barnes	e2d47a133b	Disable c10::optional macros (#138912 ) Test Plan: Sandcastle Pull Request resolved: https://github.com/pytorch/pytorch/pull/138912 Approved by: https://github.com/Skylion007, https://github.com/malfet	2024-12-17 09:22:47 +00:00
Laith Sakka	c3f3a6e4d2	Back out "Fix undesired specialization on slice after split. (#142372 )" (#143356 ) Summary: Original commit changeset: e54ffcc9fd48 Original Phabricator Diff: D67113058 Reviewed By: ezyang Differential Revision: D67311579 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143356 Approved by: https://github.com/oulgen	2024-12-17 09:17:18 +00:00
Adnan Akhundov	2531543c5f	[user triton cache] Dedup user-defined Triton kernels by config in codecache (#143353 ) Previously, the same kernel source with different autotuning configs would generate the same cache key which can lead to wrong cache it and silent incorrectness. Here we add the configs to the cache key in `FxGraphHashDetails`. Test Plan: ``` python3 test/inductor/test_codecache.py -k test_triton_higher_order_op_different_configs ... ---------------------------------------------------------------------- Ran 2 tests in 3.590s OK ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/143353 Approved by: https://github.com/oulgen	2024-12-17 08:41:22 +00:00
Avik Chaudhuri	6056efc5ff	non strict sequential slicing (#143298 ) Differential Revision: [D67284841](https://our.internmc.facebook.com/intern/diff/D67284841/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143298 Approved by: https://github.com/zhxchen17	2024-12-17 08:35:20 +00:00
Shunting Zhang	297ce77636	[Inductor] inplace padding (#140249 ) https://github.com/pytorch/pytorch/issues/139865 This PR may change the semantic of constant_pad_nd from 'clone' to 'view'. I tried a few tests to do inplace update. Looks like thanks to functionalization, this works fine. Perf for `test_linear_and_cel`: ``` # TORCHINDUCTOR_INPLACE_PADDING=0 DO_PERF_TEST=1 python test/inductor/test_inplace_padding.py -k test_linear_and_cel inductor_config.inplace_padding=False ms=83.311 # TORCHINDUCTOR_INPLACE_PADDING=1 DO_PERF_TEST=1 python test/inductor/test_inplace_padding.py -k test_linear_and_cel inductor_config.inplace_padding=True ms=79.827 ``` The saving is about 4ms (slightly less since we need fill 0 for the padding area). Similar savings for llm.c. - Without the feature: 182.151ms per batch, 180.9K tokens/s - With the feature: 178.278ms per batch, 183.9K tokens/s. There are 3K tokens/s increase. Perf test shows compilation time regression. . I'm not sure if that's real. Will debug more. But a good thing is, there is no accuracy failure: [link](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Mon%2C%2004%20Nov%202024%2020%3A23%3A22%20GMT&stopTime=Mon%2C%2011%20Nov%202024%2020%3A23%3A22%20GMT&granularity=hour&suite=torchbench&mode=training&dtype=amp&deviceName=cuda%20(a100)&lBranch=gh/shunting314/186/head&lCommit=03fd924ff382958daf5055dc8425d279e4e10a1e&rBranch=main&rCommit=c03324de2dfbbf0006818c86b88c92a3378f46b7) . UPDATE: Perf test regression seems to be not real. Here is a rerun [link](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Thu%2C%2007%20Nov%202024%2001%3A29%3A55%20GMT&stopTime=Thu%2C%2021%20Nov%202024%2001%3A29%3A55%20GMT&granularity=hour&suite=torchbench&mode=training&dtype=amp&deviceName=cuda%20(a100)&lBranch=gh/shunting314/186/head&lCommit=7e2c8e5d9256ac06205e7cd5e740c9e20ce804d0&rBranch=main&rCommit=565a7942eee1ddc23067cdbae597443d0f2290a0). Our dashboard is not that reliable recently due to AWS migration. Pull Request resolved: https://github.com/pytorch/pytorch/pull/140249 Approved by: https://github.com/jansel	2024-12-17 06:15:48 +00:00
bobrenjc93	a42ca5a45b	remove allow-untyped-defs for _inductor/codegen/rocm/rocm_template_buffer.py (#143272 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143272 Approved by: https://github.com/aorenste	2024-12-17 05:34:22 +00:00
drisspg	d2ec7f0756	[FlexAttention] Allow num_warps 8 since when block size >=128 (#143299 ) # Summary Fixes #143290 We already strip bad configs here: `e0e763e331/torch/_inductor/kernel/flex_attention.py (L2299)` So this shouldn't be needed. Confirming that the 64 x 128 case is valid otherwise we can just change the default config Pull Request resolved: https://github.com/pytorch/pytorch/pull/143299 Approved by: https://github.com/yanboliang	2024-12-17 05:32:41 +00:00
bobrenjc93	e7ec92331e	remove allow-untyped-defs for torch/jit/_ir_utils.py (#143366 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143366 Approved by: https://github.com/aorenste	2024-12-17 05:15:15 +00:00
Shuqi Yang	bcd3692132	[Inductor][Easy] Fix a test failure in loop_ordering_after_fusion (#142474 ) Summary: Re-land the pr. The previous one was reverted because of a test failure on SM89. The fix is just removing `xfailIfSM89`. ``` _____________________ LoopOrderingTest.test_fp8_pattern_2 ______________________ Unexpected success ``` ------ (Since I am trying the other solution for https://github.com/pytorch/pytorch/pull/141082, I moved out the test case fixes from that pr to a separate pr to land first.) ----- Testing float8 dynamic scaling case with `TORCHINDUCTOR_LOOP_ORDERING_AFTER_FUSION=1` didn't make any difference. The test case for fp8 (https://github.com/pytorch/pytorch/blob/main/test/inductor/test_loop_ordering.py#L425) is also failing, https://www.internalfb.com/intern/test/844425111960859?ref_report_id=0 ------- The main change here is to modify the condition of calling `loop_reordering` from `shared_data_score == 0` to `shared_data_score < config.score_fusion_memory_threshold`. Before the change: `shared_data_score > 0 -> won't loop_reorder -> can't fused because of shared_data_score < config.score_fusion_memory_threshold` After the change: `shared_data_score > 0 -> loop_reorder (shared_data_score < config.score_fusion_memory_threshold) -> get a larger shared_data_score -> fused` ---- It's the same issue as fixed in https://github.com/pytorch/pytorch/pull/136782. But the condition to call loop_reorder might be changed later, causing the test case to fail again. Test Plan: ``` buck2 test 'fbcode//mode/opt' caffe2/test/inductor:loop_ordering ``` ----- Ran a float8 dynamic scaling training script to verify it e2e Differential Revision: D67012816 Pull Request resolved: https://github.com/pytorch/pytorch/pull/142474 Approved by: https://github.com/eellison, https://github.com/sijiac, https://github.com/shunting314	2024-12-17 04:14:28 +00:00
Andy Lugo	500d02921b	[ROCm] CK Flash Attention Backend (#138947 ) Replaces https://github.com/ROCm/pytorch/pull/1592 This PR contains the initial implementation of SDPA with composable_kernel backend. The CK path can be forced by simply calling `torch.backends.cuda.preferred_rocm_fa_library("ck")`. Similarly, you can force the incumbent aotriton implementation by passing in "aotriton" or "default". As you'd expect, not setting this option will result in aotriton to be used as the backend. In the case of CK, if pytorch deems flash attention usable, then it will use the CK path in all the same places aotriton would have been used. This PR makes no changes to the heuristics which select which attention scheme to use (i.e. flash attention vs memory efficient attention vs math etc etc). It only gets called when flash attention is both enabled (via `USE_FLASH_ATTENTION`) and is selected at runtime by the existing heuristics. Files located in pytorch/aten/src/ATen/native/transformers/hip/flash_attn/ck/mha* have been pulled from https://github.com/Dao-AILab/flash-attention courtesy of @tridao's hard work who is the co-author NOTE: In order to use this backend, the user MUST set USE_CK_FLASH_ATTENTION=1 in their environment when they build PyTorch. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138947 Approved by: https://github.com/pruthvistony, https://github.com/xw285cornell, https://github.com/leitian Co-authored-by: Xiaodong Wang <xw285@cornell.edu>	2024-12-17 02:18:07 +00:00
Huy Do	c15638d803	Enable swap on all Linux jobs (#143316 ) A swapfile on Linux runner has been prepared by https://github.com/pytorch/test-infra/pull/6058. So this PR does 2 things: * Start using the swapfile on all Linux build and test jobs * Testing the rollout https://github.com/pytorch-labs/pytorch-gha-infra/pull/582 ### Testing Run `swapon` inside the container and the swapfile shows up correctly: ``` jenkins@259dfb0a314c:~/workspace$ swapon NAME TYPE SIZE USED PRIO /swapfile file 3G 256K -2 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/143316 Approved by: https://github.com/ZainRizvi, https://github.com/atalman	2024-12-17 02:12:24 +00:00
Michael Lazos	cb4c614ed6	[foreach-map] Add tests for backward (#143282 ) Adds tests for unary and binary foreach_map w/ backwards Pull Request resolved: https://github.com/pytorch/pytorch/pull/143282 Approved by: https://github.com/eellison	2024-12-17 02:08:12 +00:00
PyTorch MergeBot	533d63f83b	Revert "FileTimerClient: add retry logic on connect (#143318 )" This reverts commit b3fb8f8a3a2fe07ca61852b09271382c988629fc. Reverted https://github.com/pytorch/pytorch/pull/143318 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing lint jobs in trunk ([comment](https://github.com/pytorch/pytorch/pull/143318#issuecomment-2547342910))	2024-12-17 02:06:52 +00:00
cyy	201cb8834f	Enable more C++ warnings (#143099 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/143099 Approved by: https://github.com/albanD	2024-12-17 02:03:39 +00:00
Yifu Wang	af190479c8	[fused_all_gather_matmul] use _multimem_all_gather_matmul for small global Ms (#143160 ) ## Benchmark M=2048, N=3584, K=8192 baseline (nccl + cublas): 301us decomp-based async-tp: 354us comm-aware async-tp: 295us multimem_all_gather matmul: 277us As M further decreases, the multimem_all_gather approach consistently outperforms the baseline and other approaches (omitted other approaches in the chart as they start to be slower than the baseline): ![image](https://github.com/user-attachments/assets/5811455a-68c9-43fe-9d82-ca488dd77bc1) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143160 Approved by: https://github.com/weifengpy ghstack dependencies: #142283, #142810, #143159	2024-12-17 01:07:27 +00:00
Yifu Wang	286921b39e	[fused_all_gather_matmul] introduce an argument to specify whether the all-gather result needs to be returned (#143159 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143159 Approved by: https://github.com/weifengpy ghstack dependencies: #142283, #142810	2024-12-17 01:07:27 +00:00
Yifu Wang	6fae60a34a	[SymmetricMemory] introduce multimem_all_gather (#142810 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/142810 Approved by: https://github.com/weifengpy ghstack dependencies: #142283	2024-12-17 01:07:27 +00:00
PyTorch MergeBot	519d858c31	Revert "Kill capture_pre_autograd_graph API (#143224 )" This reverts commit 4c62275325afe21052f3fd49ed4135e3db3c47eb. Reverted https://github.com/pytorch/pytorch/pull/143224 on behalf of https://github.com/huydhn due to Sorry for reverting your change but the XLA failure is legit ([comment](https://github.com/pytorch/pytorch/pull/143224#issuecomment-2547264675))	2024-12-17 00:47:24 +00:00
Will Constable	9d57a39541	[C10D] Update docs for wait() (#143305 ) Clarify that currently active stream, not default stream, is the one that will be blocked by a call to wait(), and also point out that the CPU is not blocked by the call for CUDA/nccl collectives. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143305 Approved by: https://github.com/LucasLLC, https://github.com/ngimel	2024-12-17 00:41:11 +00:00
Tristan Rice	b3fb8f8a3a	FileTimerClient: add retry logic on connect (#143318 ) Fixes #143188 The fifo server binds from a thread -- under rare cases the client connects before the server thread starts. This adds a retry when opening the fifo socket in non-blocking mode. This will wait up to 1s for the server to start which balances fast error messages while still providing some wiggle room on the server side. Test plan: ``` pytest --minutes 10 test/distributed/elastic/timer/file_based_local_timer_test.py -k test_watchdog_call_count -x ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/143318 Approved by: https://github.com/fegin	2024-12-17 00:36:10 +00:00
Andrew Gu	90fb7c36ab	[FSDP2] Clamp `reduce_dtype` in lazy init (#143297 ) fixes https://github.com/pytorch/pytorch/issues/143277 by moving the clamp of `reduce_dtype` to `None` to lazy init (same place as where `param_dtype` can be clamped to `None`) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143297 Approved by: https://github.com/weifengpy	2024-12-17 00:25:08 +00:00
atalman	dd2cd4279e	Create build_directory if it does not exist when generating ninja build file (#143328 ) Fixes: https://github.com/pytorch/vision/issues/8816 I am observing this failure on Windows, Python 3.13 vision builds: ``` Emitting ninja build file C:\actions-runner\_work\vision\vision\pytorch\vision\build\temp.win-amd64-cpython-313\Release\build.ninja... error: [Errno 2] No such file or directory: 'C:\\actions-runner\\_work\\vision\\vision\\pytorch\\vision\\build\\temp.win-amd64-cpython-313\\Release\\build.ninja' ERROR conda.cli.main_run:execute(49): `conda run packaging/windows/internal/vc_env_helper.bat python setup.py bdist_wheel` failed. (See above for error) ``` Adding the code above fixes it, confirmed by running `` python setup.py bdist_wheel`` : ``` building 'torchvision._C' extension Emitting ninja build file C:\actions-runner\_work\vision\vision\pytorch\vision\build\temp.win-amd64-cpython-313\Release\build.ninja... Creating build directory C:\actions-runner\_work\vision\vision\pytorch\vision\build\temp.win-amd64-cpython-313\Release Compiling objects... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) [1/26] cl /showIncludes /nologo /O2 /W3 /GL /DNDEBUG /MD /MD /wd4819 /wd4251 /wd4244 /wd4267 /wd4275 /wd4018 /wd4190 /wd4624 /wd4067 /wd4068 /EHsc -Dtorchvision_EXPORTS -IC:\actions-runner\_work\vision\vision\pytorch\vision\torchvision\csrc -IC:\actions-runner\_work\_temp\conda_environment_12361066769\Lib\site-packages\torch\include -IC:\actions-runner\_work\_temp\conda_environment_12361066769\Lib\site-packages\torch\include\torch\csrc\api\include -IC:\actions-runner\_work\_temp\conda_environment_12361066769\Lib\site-packages\torch\include\TH -IC:\actions-runner\_work\_temp\conda_environment_12361066769\Lib\site-packages\torch\include\THC -IC:\actions-runner\_work\_temp\conda_environment_12361066769\include -IC:\actions-runner\_work\_temp\conda_environment_12361066769\Include "-IC:\Pr ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/143328 Approved by: https://github.com/kit1980, https://github.com/albanD	2024-12-17 00:20:43 +00:00
Bin Bao	467970d683	[AOTI] Relax input alignment assertion (#143236 ) Summary: https://github.com/pytorch/pytorch/pull/142136 added a runtime alignment assertion. But the assumption is probably too strict for more flexible use cases of AOTI, e.g. python deployment, see a recent error torchchat ran into for more details, https://github.com/pytorch/torchchat/actions/runs/12322072267/job/34394851280 . This PR relaxes the runtime check and implements copy_misaligned_inputs in cpp instead. Differential Revision: [D67287922](https://our.internmc.facebook.com/intern/diff/D67287922) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143236 Approved by: https://github.com/malfet, https://github.com/chenyang78	2024-12-17 00:17:39 +00:00
bobrenjc93	c4ab3e6ceb	remove allow-untyped-defs for torch/__config__.py (#143320 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143320 Approved by: https://github.com/aorenste ghstack dependencies: #143319	2024-12-17 00:16:09 +00:00
bobrenjc93	0178e43949	remove allow-untyped-defs for torch/utils/_stats.py (#143319 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143319 Approved by: https://github.com/aorenste	2024-12-17 00:16:09 +00:00
Shivam Raikundalia	ff373171d0	[Profiler] Add Optional Flag to turn off external correlations v2 (#143314 ) Summary: The original diff got reverted because its base commit was on a broken version of pytorch that was failing rocm tests. There is no indication that this diff had any effect on rocm. Had trouble rebasing the GH pr after revert and accidentally closed the PR so submitting again . Test Plan: See original PR with same name Differential Revision: D67293040 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143314 Approved by: https://github.com/leitian, https://github.com/aaronenyeshi	2024-12-16 23:49:13 +00:00
rzou	10df370a77	Add missing IValue overloads for SymInt lists (#143167 ) We should be able to convert Int lists into SymInt lists. Test Plan: - new tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/143167 Approved by: https://github.com/ezyang ghstack dependencies: #143166	2024-12-16 23:18:55 +00:00
rzou	557da8014d	[gen_autograd_functions] rename some variables (#143166 ) This is a follow-up from https://github.com/pytorch/pytorch/pull/141278. Test Plan: - existing tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/143166 Approved by: https://github.com/soulitzer	2024-12-16 23:18:55 +00:00
Shangdi Yu	4c62275325	Kill capture_pre_autograd_graph API (#143224 ) Summary: Delete the following API: - capture_pre_autograd_graph() - capture_pre_autograd_graph_using_training_ir() - gm_using_training_ir() There's no more call sites to `capture_pre_autograd_graph`. Except 1) two test cases in coreml, PR to remove: https://github.com/apple/coremltools/pull/2400 2) XLA: one test case in pytorch/xla, PR to remove: https://github.com/pytorch/xla/pull/8398 3) a few call sites guarded by version guard (< 2.5.0) Test Plan: CI Reviewed By: tugsbayasgalan Differential Revision: D64056353 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143224 Approved by: https://github.com/tugsbayasgalan	2024-12-16 23:06:22 +00:00
PyTorch MergeBot	6356690b3d	Revert "[BE] Revert "Add conda to Manylinux Docker images (#139903 )" (#143300 )" This reverts commit c86383f956ee86f34d0ffb94bc229c51c6f11dd9. Reverted https://github.com/pytorch/pytorch/pull/143300 on behalf of https://github.com/atalman due to failing nova workflows with conda: command not found ([comment](https://github.com/pytorch/pytorch/pull/143300#issuecomment-2547030664))	2024-12-16 22:50:08 +00:00
eellison	135a2d4483	Update low prec codegen for div/mod (#142350 ) Div/mod in fp16/bf16 requires a downcast to preserve its inputs' dtypes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/142350 Approved by: https://github.com/blaine-rister	2024-12-16 21:46:08 +00:00
Bradley Davis	15aee8e090	update aten bmm CK heuristic (#143294 ) Summary: updates heuristic to use new instances based on ck profiling of LLM shapes Differential Revision: D67280269 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143294 Approved by: https://github.com/mxz297, https://github.com/xw285cornell	2024-12-16 21:44:59 +00:00
atalman	c86383f956	[BE] Revert "Add conda to Manylinux Docker images (#139903 )" (#143300 ) This reverts commit 56a40d4ebb0bcf733f1ea5f6efde805326a7a565. Having conda in manylinux builder images is not required. This was added to have manylinux-builder images as the only images for CD builds after conda-builder is deprecated. However we decided to start using ``almalinux-builder``. We are using almalinux-builder for linux_job_v2 which contains conda: https://github.com/pytorch/test-infra/blob/main/.github/workflows/linux_job_v2.yml#L114 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143300 Approved by: https://github.com/seemethere	2024-12-16 21:40:08 +00:00
Bert Maher	4e594f4d12	Triton bump for 3.2 cherry-picks (mmav3 segfault fix, gfx950 support) (#143302 ) * https://github.com/triton-lang/triton/pull/5277 * https://github.com/triton-lang/triton/pull/5084 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143302 Approved by: https://github.com/atalman, https://github.com/pruthvistony	2024-12-16 21:22:29 +00:00
Aaron Orenstein	401b1498d2	[BE] typing for decorators - distributed/_tensor/ops/utils (#142139 ) Test Plan: unit tests Differential Revision: D62302679 Pull Request resolved: https://github.com/pytorch/pytorch/pull/142139 Approved by: https://github.com/Skylion007, https://github.com/kwen2501	2024-12-16 21:19:33 +00:00
Aaron Orenstein	159b7ad8aa	Improve async workers to handle forking for async compile (#142072 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/142072 Approved by: https://github.com/masnesral	2024-12-16 21:16:42 +00:00
xadupre	678f74988d	Fix a misspelling [ONNX] (#143301 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143301 Approved by: https://github.com/titaiwangms	2024-12-16 20:19:41 +00:00
bobrenjc93	8ad842cda4	remove allow-untyped-defs for utils/data/datapipes/dataframe/structures.py (#143273 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143273 Approved by: https://github.com/aorenste ghstack dependencies: #143271	2024-12-16 20:07:36 +00:00
PyTorch MergeBot	54ed13cdce	Revert "Update low prec codegen for div/mod (#142350 )" This reverts commit ca973069ed9a08782695d9407605e219008821e2. Reverted https://github.com/pytorch/pytorch/pull/142350 on behalf of https://github.com/huydhn due to Sorry for reverting your change but I think it. breaks an internal test ([comment](https://github.com/pytorch/pytorch/pull/142350#issuecomment-2546615951))	2024-12-16 20:05:14 +00:00
Adnan Akhundov	e885225eda	Add persistent+TMA version of Triton mm and addmm (#142101 ) This PR adds persistent+TMA versions (Triton template + the corresponding infra) for the `tuned_mm` and `tuned_addmm` lowerings. The persistent+TMA choices are added to the GEMM autotuning if (checked by the `use_triton_tma_template` helper): 1. The min. hardware and Triton version requirements are met for the TMA support. 2. The GEMM inputs are compatible with the Triton TMA API (i.e., 16-byte aligned and contiguous). 3. The `config.triton.enable_persistent_tma_matmul` is set to `True`. Additional notes: 1. As added in this PR, the TMA uses are not compatible with prolog / epilogue fusion. To this end, in the new Triton template we currently support: TMA-based loads of A/B, but no prologue fusion; epilogue fusion, but no TMA-based stores of C. TMA + fusion compatibility can be added as a follow-up. 2. The current Triton TMA API (`experimental_device_tensormap_create2d`) does not support strides. Due to this, we limit the applicability of the new Triton template to the cases where the inputs are contiguous. 3. The transposed layouts of A and / or B are supported by passing the constexpr flags to the kernel and adjusting the ordering of the block sizes accordingly in the kernel code (this should have no effect on the kernel perf, as decided at the Triton compilation time). 4. After the next Triton pin update, we can switch to the tensor descriptor API (landed recently in https://github.com/triton-lang/triton/pull/5290) in the new Triton template, which should allow lifting 2 and 3 above. 5. The configs for the new Triton template in `persistent_mm_kernel_configs` are preliminary. We should do more perf exploration and possibly augment the config in a follow-up. 6. This PR is rebased onto and unifies with two related PRs landed previously: https://github.com/pytorch/pytorch/pull/142045 (some infra unification with the persistent+TMA template for _scaled_mm) and https://github.com/pytorch/pytorch/pull/134532 (add possibility to disable prolog fusion for selected choices). 7. The current Triton TMA API only supports 1D and 2D descriptors (even after https://github.com/triton-lang/triton/pull/5290, see [here](`9829ce87cc/python/triton/language/core.py (L1957)`)). For now, this blocks adding persistent+TMA template for `torch.bmm`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/142101 Approved by: https://github.com/drisspg, https://github.com/eellison	2024-12-16 19:12:12 +00:00
Oguz Ulgen	17b71e5d6a	Add config alias (#142088 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/142088 Approved by: https://github.com/c00w	2024-12-16 18:51:17 +00:00
William Wen	1b6b86fad7	[dynamo] disable eval frame callback around most of _TorchDynamoContext wrapper function (#143211 ) Internal xref: https://fb.workplace.com/groups/1075192433118967/permalink/1559636954674510/ If the `_fn` returned by `_TorchDynamoContext.__call__` makes an external function call, dynamo is recursively invoked. This can cause issues if there are added calls that are not skipped by Dynamo. So we should disable the eval frame callback as much as possible. Differential Revision: [D67211749](https://our.internmc.facebook.com/intern/diff/D67211749) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143211 Approved by: https://github.com/jansel	2024-12-16 18:38:58 +00:00
Animesh Jain	1bf983077f	[reland][dynamo][guards] Consider tensors as immutable for dict tag matches (#141085 ) Reland - https://github.com/pytorch/pytorch/pull/139560 As mentioned in https://github.com/pytorch/pytorch/pull/130341, using `static py::object` can lead to segfaults. I suspect this is the reason for the import system error seen internally (https://www.internalfb.com/sevmanager/view/469592). In this PR, I am removing the `static` part. This is fine and also the right thing to do because this will catch if user changes the flag in the same process for compiling two different functions. Unfortunately, there is no easy way to trigger this segfault, so I can't write a test. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141085 Approved by: https://github.com/jansel Co-authored-by: William Wen <williamwen@meta.com>	2024-12-16 18:38:32 +00:00
Jeeja	338835d0d2	Add support for other backends in get_preferred_device (#132118 ) Currenlty get_preferred_device supports only cuda and cpu. Add support for other backends using backend config. Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/132118 Approved by: https://github.com/kwen2501	2024-12-16 18:30:41 +00:00
leslie-fang-intel	ccf35af142	[Inductor] Fix the Index Put lowering with same input of self and values (#139366 ) Summary Fix the issue: https://github.com/pytorch/pytorch/issues/138908, the root-cause is in https://github.com/pytorch/pytorch/issues/138908#issuecomment-2449192447 Test Plan ``` python -u -m pytest -s -v test/inductor/test_cpu_repro.py -k test_index_put python -u -m pytest -s -v test/inductor/test_cpu_repro.py -k test_index_add ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/139366 Approved by: https://github.com/jgong5, https://github.com/eellison	2024-12-16 17:07:14 +00:00
PyTorch MergeBot	7ab3177776	Revert "[AMD] Turn on TF32 for aten::mm (#139869 )" This reverts commit e0bdae7884aed09d9e3f1a3f7a53c095e74a9aff. Reverted https://github.com/pytorch/pytorch/pull/139869 on behalf of https://github.com/jeffdaily due to causing ROCm CI failures, need to investigate, revert for now ([comment](https://github.com/pytorch/pytorch/pull/139869#issuecomment-2546127069))	2024-12-16 16:46:48 +00:00
chuanqiw	a8cc19bb51	[CD] Fix XPU linux CD whl test failure (#143268 ) Follow https://github.com/pytorch/pytorch/pull/142482, refer the original fix PR https://github.com/pytorch/pytorch/pull/130742 and new issue in https://github.com/pytorch/pytorch/actions/runs/12323126436/job/34403681230 Works for https://github.com/pytorch/pytorch/issues/114850 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143268 Approved by: https://github.com/atalman	2024-12-16 15:00:03 +00:00
PyTorch UpdateBot	e4d2e81086	Update slow tests (#143278 ) This PR is auto-generated weekly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/weekly.yml). Update the list of slow tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143278 Approved by: https://github.com/pytorchbot	2024-12-16 12:40:40 +00:00
bobrenjc93	d745b2b516	remove allow-untyped-defs for distributed/rpc/_testing/__init__.py (#143271 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143271 Approved by: https://github.com/aorenste	2024-12-16 02:35:37 +00:00
Yu, Guangye	9706ada369	[RELAND] Add device-agnostic runtime Device/Stream C++ API (#138677 ) # Motivation This PR intends to add C++ accelerator device-agnostic APIs. # Additional Context This PR is relanded. It is reverted because `torch.Event` doesn't support mps backend. We have fixed it in https://github.com/pytorch/pytorch/pull/142468. The previous commit is `f84e533a2c` Pull Request resolved: https://github.com/pytorch/pytorch/pull/138677 Approved by: https://github.com/albanD, https://github.com/EikanWang ghstack dependencies: #143171, #133572	2024-12-16 02:18:41 +00:00
Yu, Guangye	45ac4ebf15	[RELAND] Add UTs for accelerator device-agnostic runtime APIs (#133572 ) # Motivation This PR intends to add UTs for accelerator device-agnostic APIs. # Additional Context This PR is relanded. It is reverted because `torch.Event` doesn't support mps backend. We have fixed it in https://github.com/pytorch/pytorch/pull/142468. The previous commit is `952514f0c8` Pull Request resolved: https://github.com/pytorch/pytorch/pull/133572 Approved by: https://github.com/EikanWang, https://github.com/albanD ghstack dependencies: #143171	2024-12-16 02:18:41 +00:00
Yu, Guangye	c1d4d9d3cf	[MPS] Support torch.accelerator.synchronize() on mps (#143171 ) # Motivation Support `torch.accelerator.synchronize()` on mps. The root cause is that MPS doesn't support lazy initialization. So we must check if the current accelerator supports device lazy initialization rather than early return. # Additional Context Add a mps UT to test code change. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143171 Approved by: https://github.com/albanD	2024-12-16 02:18:32 +00:00
cyy	af8789c056	Hide torch_python symbols (#142214 ) Change symbols in torch_python to invisible by default on platforms other than Apple. Pull Request resolved: https://github.com/pytorch/pytorch/pull/142214 Approved by: https://github.com/ezyang	2024-12-16 00:59:26 +00:00
drisspg	744a303dee	[FlexAttention] Optimzing learned bias perf to dq calc (#142281 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/142281 Approved by: https://github.com/Chillee	2024-12-15 21:44:32 +00:00
Xiaodong Wang	e0bdae7884	[AMD] Turn on TF32 for aten::mm (#139869 ) Summary: hipblaslt supports TF32, so adding the support. Test Plan: CI Differential Revision: D65435392 Pull Request resolved: https://github.com/pytorch/pytorch/pull/139869 Approved by: https://github.com/leitian	2024-12-15 10:02:29 +00:00
PyTorch UpdateBot	5273d8fd2a	[audio hash update] update the pinned audio hash (#143265 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned audio hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143265 Approved by: https://github.com/pytorchbot	2024-12-15 03:41:14 +00:00
PyTorch MergeBot	9ed045eae9	Revert "[Profiler] Add Optional Flag to turn off external correlations (#142516 )" This reverts commit b29fc52f827cc4b4336ecd24cc0a019ec9cf24b6. Reverted https://github.com/pytorch/pytorch/pull/142516 on behalf of https://github.com/huydhn due to Sorry for reverting your change but the test is failing on ROCm ([comment](https://github.com/pytorch/pytorch/pull/142516#issuecomment-2543431758))	2024-12-15 03:34:37 +00:00
Simon Fan	dd2d360b7d	[ca] re-enable disabled tests (#143247 ) FIXES https://github.com/pytorch/pytorch/issues/133197 The unspecified floats PR landed while this test was disabled, and it added an analysis restart which counts towards the backend call counter the test is using Pull Request resolved: https://github.com/pytorch/pytorch/pull/143247 Approved by: https://github.com/zou3519	2024-12-15 02:11:39 +00:00
cyy	4273e1a059	[5/N] Apply bugprone-unchecked-optional-access (#143111 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/143111 Approved by: https://github.com/Skylion007	2024-12-15 01:07:28 +00:00
Tom Ritchford	91bf2e16de	[distributed] Remove unused variable in test_composability/test_pp_composability.py (#143191 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143191 Approved by: https://github.com/mori360	2024-12-14 12:23:44 +00:00
Avik Chaudhuri	de484134e4	support slicing with symints in non-strict (#143217 ) Differential Revision: [D67215745](https://our.internmc.facebook.com/intern/diff/D67215745/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143217 Approved by: https://github.com/tugsbayasgalan	2024-12-14 10:27:45 +00:00
Michael Suo	9933e59c2b	[torch][cuda] fix race condition in cuda initialization (#143238 ) The access to lazy init callbacks (`_lazy_seed_tracker` and `_queued_calls`) is not synchronized with the initialization lock. This exposes us to the following race: 1. start `_lazy_init` 2. take `_initialization_lock` 3. flush `_queued_calls` and run them all 4. another thread comes in and uses `_lazy_call` to put something on the queue (in our case, the `manual_seed`) 5. original thread finishes initializing, but never runs that call Pull Request resolved: https://github.com/pytorch/pytorch/pull/143238 Approved by: https://github.com/ngimel	2024-12-14 07:41:24 +00:00
Oguz Ulgen	28d8297712	Migrate compiler config to Config (#143152 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143152 Approved by: https://github.com/ezyang ghstack dependencies: #143229	2024-12-14 07:38:25 +00:00
Oguz Ulgen	7c4d29485e	Add typechecking indirection for Config (#143229 ) When we create a Config[T], we actually dynamically unbox this in the module, so lets have type checker believe that Config[T] creates a T. This enables proper typechecking support. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143229 Approved by: https://github.com/aorenste	2024-12-14 07:38:25 +00:00
Will Feng	be5b342332	[Inductor] Move peak memory pass and overlap pass to be run at the right place (#142822 ) This PR moves `decide_global_ordering_of_comms` to run first before all other Inductor scheduler passes, so that downstream passes have the correct dependency tracking info. It also moves peak memory pass and overlap pass to the end of all passes, because they need to be the final decision maker on the node order to achieve the desired peak memory and overlap. This PR fixes hard-to-debug peak memory pass errors caused by incorrect tracking in `.unmet_dependencies` during the enablement of SimpleFSDP on internal models. Pull Request resolved: https://github.com/pytorch/pytorch/pull/142822 Approved by: https://github.com/eellison	2024-12-14 06:53:02 +00:00
Heiner	3cc617b6a7	`__cuda_array_interface__`: Use "<V2" for bfloat16. (#143042 ) Rationale: While Numpy doesn't support `bfloat16` and therefore there's no official typestr for `bfloat16` in `__array_interface__` (https://docs.scipy.org/doc/numpy-1.13.0/reference/arrays.interface.html#__array_interface__), JAX/ml_dtypes uses "<V2": ``` >>> from jax import numpy as jnp >>> jnp.bfloat16.dtype.str '<V2' ``` Using the same in PyTorch has the upside of making the typestrs returned by `__cuda_array_interface__` identify the torch dtype uniquely. ### Misc notes (1) JAX itself just refuses to do `__cuda_array_interface__` for `bfloat16`: ``` >>> from jax import numpy as jnp >>> jnp.arange(10, dtype=jnp.bfloat16).__cuda_array_interface__ Traceback (most recent call last): File "<stdin>", line 1, in <module> jaxlib.xla_extension.XlaRuntimeError: INVALID_ARGUMENT: __cuda_array_interface__ is not supported for bfloat16 buffers. ``` (2) The "official" description of `__cuda_array_interface__` doesn't mention bfloat16, it just references `__array_interface__`: https://numba.readthedocs.io/en/stable/cuda/cuda_array_interface.html (3) Ongoing issue for numpy to support bfloat16: https://github.com/numpy/numpy/issues/19808 (4) Tweet that triggered this: https://x.com/HeinrichKuttler/status/1866761979349844211, with @ezyang responding. (5) "<V2" is kinda weird, as it's a "little-endian void" type. When given to Numpy, it gets turned into endian-agnostic: ``` >>> import numpy as np >>> import ml_dtypes >>> np.dtype("bfloat16").str '<V2' >>> np.dtype("<V2").str '\|V2' ``` Still, it makes sense to have a unique string for `bfloat16` and since Google chose "<V2" we might as well use that. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143042 Approved by: https://github.com/ezyang	2024-12-14 06:27:52 +00:00
Nichols A. Romero	c0a39ad35a	[ROCm] Fix TunableOp UTs: Rotating Buffer (#143172 ) TunableOp's rotating buffer feature cannot be properly tested because the environment variable that controls this feature is sticky. A Python API is introduced to modify this value. Additional items in this PR: * UT for rotating buffer API * Clean up UTs that were setting the rotating buffer via the environment variable * Align behavior of environment variable and Python API when a negative value (< 0) is set. * Update documentation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143172 Approved by: https://github.com/jeffdaily	2024-12-14 06:18:11 +00:00
Peter Bell	96c3b2c388	Expose remaining sharedMem cudaDeviceProps to python (#143226 ) Was a bit too fast with my earlier PR, `sharedMemPerMultiprocessor` includes some memory that is reserved for the system. The amount a kernel can actually use is limited by `sharedMemPerBlockOptin`. I also expose `sharedMemPerBlock` for completeness. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143226 Approved by: https://github.com/ezyang	2024-12-14 06:13:28 +00:00
cyy	4764303cc6	Use static initialization to avoid once_flag in getCUDAHooks (#143198 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/143198 Approved by: https://github.com/albanD	2024-12-14 06:05:41 +00:00
Edward Z. Yang	23379e8933	Add torch._compile to uninteresting files (#143209 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/143209 Approved by: https://github.com/albanD	2024-12-14 05:40:21 +00:00
eellison	ca973069ed	Update low prec codegen for div/mod (#142350 ) Div/mod in fp16/bf16 requires a downcast to preserve its inputs' dtypes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/142350 Approved by: https://github.com/blaine-rister	2024-12-14 03:53:28 +00:00
Edward Z. Yang	24f24eebde	Get rid of _lazy_import hack (#143213 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/143213 Approved by: https://github.com/aorenste, https://github.com/albanD	2024-12-14 03:46:21 +00:00
PyTorch UpdateBot	698eefaddd	[audio hash update] update the pinned audio hash (#143245 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned audio hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143245 Approved by: https://github.com/pytorchbot	2024-12-14 03:37:56 +00:00
cyy	e9f6045e80	[15/N] Fix extra warnings brought by clang-tidy-17 (#143100 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/143100 Approved by: https://github.com/Skylion007	2024-12-14 03:24:10 +00:00
Eric Hanson	33dee721ae	Reraise worker errors as runtime errors in more cases when the original exception can't be constructed (#140911 ) related to https://github.com/pytorch/pytorch/issues/34130 when pytorch attempts to re-raise an exception from a worker process (e.g. multiprocessing dataloader), if it can't reconstruct the original exception message due to a type error, it instead raises it as a runtime error. However, if it can't reconstruct the exception for some other reason, it throws an error with a stacktrace pointing to the `ExceptionWrapper` code rather than the original underlying issue. One case in which I run into this is with boto3's [HTTPClientError](`66dc1f8d52/botocore/exceptions.py (L94)`)s. They must be constructed with a keyword argument `error`, but if `error` isn't passed, a `KeyError` is thrown instead of a `TypeError`, due to the particular way it is implemented: * [HTTPClientError](`66dc1f8d52/botocore/exceptions.py (L94)`)'s constructor excepts variable keyword arguments it passes to `super` (BotoCoreError) * [it also defines a field `fmt` with `error`](`66dc1f8d52/botocore/exceptions.py (L95)`) * BotoCoreError [expects to be able to format that string with the kwargs](`66dc1f8d52/botocore/exceptions.py (L41)`) So in this case, if a HTTPClientError occurs on a worker process, you simply get a `KeyError: error` with a stacktrace pointing to [this line](`3e2f276a14/torch/_utils.py (L710)`) which is unhelpful. Instead, I propose to reraise the error as a `RuntimeError` unconditionally. Pull Request resolved: https://github.com/pytorch/pytorch/pull/140911 Approved by: https://github.com/vmoens	2024-12-14 03:11:36 +00:00
Simon Fan	cdc03f99b7	[ca] add graph id (#141906 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/141906 Approved by: https://github.com/jansel ghstack dependencies: #141919	2024-12-14 03:02:06 +00:00
Nikita Shulga	19f3570000	[EZ] Remove `--pre` from numpy installation command (#143237 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143237 Approved by: https://github.com/janeyx99, https://github.com/kit1980	2024-12-14 02:55:21 +00:00
xinan.lin	bf8d4f5b7a	[Inductor UT] Generalize device-bias code in test_triton_syntax.py. (#143178 ) Fix #143177 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143178 Approved by: https://github.com/eellison	2024-12-14 02:08:32 +00:00
Arash Pakbin	86c3370bc3	operator benchmark: write output to a JSON (#142809 ) This pull request adds the functionality of writing the output of operator benchmark to an optional JSON file specified. The output is still printed in the terminal like before, but the user has the option of saving it in a JSON file as well. Main part of the functionality is implemented using the function _perf_result_to_dict which outputs a dictionary to be put inside a JSON file. Each dictionary corresponds to a single test. Pull Request resolved: https://github.com/pytorch/pytorch/pull/142809 Approved by: https://github.com/albanD	2024-12-14 01:42:00 +00:00
zeshengzong	12098ad242	Add torch.cat tensors type promotion description (#141339 ) Fixes #126964 Add note description about type promotion of `torch.cat` Test Result Before ![image](https://github.com/user-attachments/assets/2449f11b-48ed-406e-b73e-6d00f8eadb00) After ![image](https://github.com/user-attachments/assets/cba99572-e8b1-4b9c-ba95-a963b54859ba) Pull Request resolved: https://github.com/pytorch/pytorch/pull/141339 Approved by: https://github.com/albanD	2024-12-14 01:36:41 +00:00
Scott Wolchok	13233e062d	Fix Apple Clang ICE when building with -march=armv8.6a (#142879 ) When investigating #142703, I found that the build with -march=armv8.6 on my M1 mac was hitting a clang ICE. When looking at the blame code, I finally noticed that this constructor was nonsense, apparently in a way that the compiler frontend accepted but the backend choked on. example ICE error message: ``` fatal error: error in backend: Cannot select: 0x12689c260: bf16 = uint_to_fp 0x1258324a0 0x1258324a0: i32 = AssertZext 0x125822d90, ValueType:ch:i16 0x125822d90: i32,ch = CopyFromReg 0x1238dddc0, Register:i32 %22 0x12689c6c0: i32 = Register %22 In function: _ZN2at6native7DEFAULTL12logit_kernelERNS_18TensorIteratorBaseERKN3c106ScalarE c++: error: clang frontend command failed with exit code 70 (use -v to see invocation) Apple clang version 16.0.0 (clang-1600.0.26.3) Target: arm64-apple-darwin24.1.0 Thread model: posix ``` Unbreaks `env CFLAGS=-march=armv8.6-a CXXFLAGS=-march=armv8.6-a python setup.py develop --cmake` on M1 Mac. Differential Revision: [D67102953](https://our.internmc.facebook.com/intern/diff/D67102953/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/142879 Approved by: https://github.com/malfet	2024-12-14 01:07:01 +00:00
Bradley Davis	063194aa32	add additional CK BMM Instances (2) (#142874 ) Summary: stacked changes to keep new codegen-ed instances below 2000 LOC Reviewed By: zjing14 Differential Revision: D66985408 Pull Request resolved: https://github.com/pytorch/pytorch/pull/142874 Approved by: https://github.com/mxz297	2024-12-14 01:04:34 +00:00
leslie-fang-intel	00b0210139	[Inductor] Use sleef implementation for CPP backend asinh codegen (#142360 ) Summary Fix https://github.com/pytorch/pytorch/issues/142345. Previously, we use `asinh(x) = log(x + sqrt(1 + x2))` to calculate the result of `asinh`, the issue happens when input with `-10000.1`, which makes `x + sqrt(1 + x2)` close to 0 and log(0) is invalid. We use the `sleef` implementation in this PR to fix this issue. Test Plan ``` python -u -m pytest -s -v test/inductor/test_cpu_repro.py -k test_asinh_with_corner_inputs ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/142360 Approved by: https://github.com/jgong5	2024-12-14 00:27:55 +00:00
eellison	d53164880f	dont attempt to fuse in unaligned accesses to mm (#142435 ) This isn't profitable - we were trying to fuse in a padding of unaligned mm, which defeats padding's purpose. Pull Request resolved: https://github.com/pytorch/pytorch/pull/142435 Approved by: https://github.com/jansel ghstack dependencies: #142401, #142402	2024-12-14 00:22:31 +00:00
albanD	70be7900bb	Fix Tensor clear to properly clear slots (#143203 ) Fixes a bug introduced in https://github.com/pytorch/pytorch/pull/137267 While the test ensures the finalizer did run to make sure things are cleared, the objects are not properly collected by the gc due to the faulty tp_clear implementation. So, while the finalizer did run, the object was still alive. Fixing this by giving tp_clear the same treatment as tp_traverse and tp_dealloc on Tensor: make it a unique function that handles the full subclass hierarchy in one place. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143203 Approved by: https://github.com/ezyang, https://github.com/colesbury ghstack dependencies: #143202	2024-12-14 00:17:07 +00:00
albanD	8741d72e3c	move function before modifying it (#143202 ) This is a no-op. Just to make the diff in the next PR easier to read Pull Request resolved: https://github.com/pytorch/pytorch/pull/143202 Approved by: https://github.com/ezyang, https://github.com/janeyx99	2024-12-14 00:17:07 +00:00
atalman	3bfdf6f063	Exclude py 31.3t triton package from PyTorch 3.13t wheel (#143218 ) Follow up after https://github.com/pytorch/pytorch/pull/143162 Include triton only for 3.13 packages not 3.13t Pull Request resolved: https://github.com/pytorch/pytorch/pull/143218 Approved by: https://github.com/kit1980	2024-12-14 00:12:45 +00:00
Nikita Shulga	515abb7744	[CI] Add Triton 3.13t build (#143212 ) By just extending the matrix and invoking script with appropriate cpython runtime Pull Request resolved: https://github.com/pytorch/pytorch/pull/143212 Approved by: https://github.com/clee2000, https://github.com/atalman, https://github.com/seemethere	2024-12-13 23:45:47 +00:00
eellison	8621b9ff0c	Infer whether prologues can be computed without upcasting to fp32 without changing numerics (#142402 ) For prologues which only do either loads like gathers or dtype conversions, and no actual arithmetic on lower-precision types, we can codegen them without upcasting to fp32 without changing numerics. Prologues that actually do arithmetic will need to use invoke quant. But I would like to to support upcasts/gathers out of the box. We could potentially extend this in the future to avoid upcasting max pooling operations as well, if there were perf benefits to be had (less likely). Pull Request resolved: https://github.com/pytorch/pytorch/pull/142402 Approved by: https://github.com/jansel ghstack dependencies: #142401	2024-12-13 23:25:15 +00:00
PyTorch MergeBot	4e0de50eb5	Revert "[CI] Add Triton 3.13t build (#143212 )" This reverts commit 571cd92d7c4c7bd2d5f068b5a285e0e70b8d0a40. Reverted https://github.com/pytorch/pytorch/pull/143212 on behalf of https://github.com/janeyx99 due to lint is failing, the other failures don't seem relevant but ci has turned red after this change haha ([comment](https://github.com/pytorch/pytorch/pull/143212#issuecomment-2542521875))	2024-12-13 23:03:45 +00:00
PyTorch MergeBot	f406207af2	Revert "[ROCm] Prune old gfx archs gfx900/gfx906 from binaries (#142827 )" This reverts commit 1e2b841675e50a6abd8dab9a95b33fda64b12e2b. Reverted https://github.com/pytorch/pytorch/pull/142827 on behalf of https://github.com/jeffdaily due to prematurely dropped support for gfx900/gfx906 ([comment](https://github.com/pytorch/pytorch/pull/142827#issuecomment-2542507857))	2024-12-13 22:48:44 +00:00
eellison	ad2faec8bb	Add a pass which analyzes whether a prologue preserves zero mask (#142401 ) We load inputs to prologue fusion with a mask. That mask must still be zero before we run `tl.dot`. Previously, we would always apply the mask: ``` tmp0 = tl.load(in_ptr1 + (tl.broadcast_to(xindex, xindex.shape)), a_mask, eviction_policy='evict_last') tmp1 = tmp0.to(tl.float32) a = tl.where(a_mask, tmp1, 0.0) ``` now we do not need to -> ``` tmp0 = tl.load(in_ptr1 + (tl.broadcast_to(xindex, xindex.shape)), a_mask, eviction_policy='evict_last') tmp1 = tmp0.to(tl.float32) a = tmp1 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/142401 Approved by: https://github.com/jansel	2024-12-13 22:37:33 +00:00
Shivam Raikundalia	b29fc52f82	[Profiler] Add Optional Flag to turn off external correlations (#142516 ) Summary: External Correlations are super spammy and oftentimes not even useful. Add flag during init to remove them entirely Test Plan: https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/traces/dynocli/devvm2185.cco0.facebook.com/rank-0.Dec_10_12_33_31.531106.pt.trace.json.gz&bucket=gpu_traces Differential Revision: D67048206 Pull Request resolved: https://github.com/pytorch/pytorch/pull/142516 Approved by: https://github.com/ngimel	2024-12-13 22:32:09 +00:00
Shangdi Yu	bb574abe73	[BC-Breaking]Remove capture_pre_autograd_graph references in quantization (#139505 ) Summary: As title This is a BC-breaking change because graph produced by "capture_pre_autograd_graph" cannot be input to quantization anymore. But this is ok, since this API is deprecated for a while and is going to be deleted. We have removed all call sites of it. We remove the deprecated API references in code, docs, and tests. We also removed two tests that specific to capture_pre_autograd_graph API. Test Plan: CI Differential Revision: D65351887 Pull Request resolved: https://github.com/pytorch/pytorch/pull/139505 Approved by: https://github.com/tugsbayasgalan, https://github.com/andrewor14, https://github.com/jerryzh168	2024-12-13 22:26:22 +00:00
Tom Ritchford	d25e6e623f	Fix unused Python variables in test/[a-d]* (#134665 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134665 Approved by: https://github.com/albanD	2024-12-13 22:13:12 +00:00
Brian Hirsh	e19f493f02	add private config to temporarily preserve old FSDP guard behavior (#142871 ) Summary: https://github.com/pytorch/pytorch/pull/138819 wobbled dynamo guards in a way that caused some performance regression, so this PR temporarily adds a config to get the old behavior back while we investigate. Test Plan: CI Differential Revision: D67096751 Pull Request resolved: https://github.com/pytorch/pytorch/pull/142871 Approved by: https://github.com/yf225	2024-12-13 22:06:48 +00:00
Shangdi Yu	8fae4397b4	Add "inductor_pre_grad_graph" logging (#142717 ) (#143126 ) Summary: Add new structured logging "inductor_pre_grad_graph" This is for inductor provenance tracking front-end to load this graph from tlparse. ghstack-source-id: 257581974 exported-using-ghexport Test Plan: ``` buck2 run 'fbcode//mode/dev-nosan' //caffe2/test/dynamo:test_dynamo -- -r StructuredTraceTest ``` Differential Revision: D67150288 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143126 Approved by: https://github.com/desertfire	2024-12-13 21:48:25 +00:00
Nikita Shulga	8a04018329	[MPS] Fix conv backward for channels last (cont) (#143196 ) This is a continuation of https://github.com/pytorch/pytorch/issues/140902 but extends the same logic to input. Looks like existing channels-last logic just produced incorrect results on pre MacOS-15 versions and fails on MacOS-15, so removing it feels like a right idea Fixes https://github.com/pytorch/pytorch/issues/142344 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143196 Approved by: https://github.com/manuelcandales	2024-12-13 21:32:42 +00:00
Nikita Shulga	571cd92d7c	[CI] Add Triton 3.13t build (#143212 ) By just extending the matrix and invoking script with appropriate cpython runtime Pull Request resolved: https://github.com/pytorch/pytorch/pull/143212 Approved by: https://github.com/clee2000, https://github.com/atalman, https://github.com/seemethere	2024-12-13 21:28:52 +00:00
Sam Larsen	60c54467db	[logging] Log runtime autotuning timing to scuba (#141919 ) See test plan in internal diff [D66679369](https://our.internmc.facebook.com/intern/diff/D66679369) Pull Request resolved: https://github.com/pytorch/pytorch/pull/141919 Approved by: https://github.com/jamesjwu, https://github.com/ezyang	2024-12-13 21:22:13 +00:00
Eddie Yan	0d6d29af38	[CUDA] Follow up to clean up some `set_per_process_memory_fraction` usage in tests (#142811 ) follow-up to #140852 now that #140620 has landed Pull Request resolved: https://github.com/pytorch/pytorch/pull/142811 Approved by: https://github.com/Skylion007	2024-12-13 21:09:05 +00:00
Yidi Wu	65d0a25289	[associative_scan] patch inductor tests to always run with static shape (#143161 ) fixes #143053 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143161 Approved by: https://github.com/eellison	2024-12-13 21:06:12 +00:00
Aaron Orenstein	52f31cc238	dynamo tracing perf: Guard slots: 51.76 -> 51.34 (#143060 ) See #143056 for overall docs. This PR: Add slots to Guard Pull Request resolved: https://github.com/pytorch/pytorch/pull/143060 Approved by: https://github.com/jansel ghstack dependencies: #143066, #143056, #143058, #143059	2024-12-13 21:02:50 +00:00
PyTorch MergeBot	e87f07d3b8	Revert "Migrate compiler config to Config (#143152 )" This reverts commit 1ebdfd56053dafa8880a0dedf535fff70aa92e09. Reverted https://github.com/pytorch/pytorch/pull/143152 on behalf of https://github.com/oulgen due to lint failure ([comment](https://github.com/pytorch/pytorch/pull/143152#issuecomment-2542342073))	2024-12-13 20:55:14 +00:00
Nikita Shulga	625b4edb97	[CD] Test torch.compile on 3.13 (#143207 ) Follow up after https://github.com/pytorch/pytorch/pull/143162 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143207 Approved by: https://github.com/atalman, https://github.com/ZainRizvi	2024-12-13 20:01:36 +00:00
atalman	fe9365f3f5	Add check_binary workflow to pytorch/pytorch (#143201 ) Migrated from pytorch/builder Related to: https://github.com/pytorch/builder/issues/2054 Copying from : `3468139e81` Pull Request resolved: https://github.com/pytorch/pytorch/pull/143201 Approved by: https://github.com/seemethere, https://github.com/malfet	2024-12-13 19:30:10 +00:00
Edward Z. Yang	8f40446770	Fix precedence of bitwise and/or printing (#143197 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/143197 Approved by: https://github.com/albanD, https://github.com/williamwen42	2024-12-13 19:29:42 +00:00
Oguz Ulgen	1ebdfd5605	Migrate compiler config to Config (#143152 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143152 Approved by: https://github.com/ezyang ghstack dependencies: #143150, #143151	2024-12-13 19:29:07 +00:00
Oguz Ulgen	f1ff8bc1c5	Add type to Config (#143151 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143151 Approved by: https://github.com/ezyang ghstack dependencies: #143150	2024-12-13 19:29:07 +00:00
Oguz Ulgen	9d05c8110d	Require Config to have a default (#143150 ) With aliases coming soon, we want to reject alias + default combo, so we need defaults to be passed in. On top of this, this simplifies statically type checking config. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143150 Approved by: https://github.com/ezyang	2024-12-13 19:28:59 +00:00
Doru Bercea	bf711a9cce	[ROCm] Improve performance of reduce sum for 3D shapes (#143137 ) Improve performance of reduce sum for 3D shapes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143137 Approved by: https://github.com/jeffdaily, https://github.com/eqy	2024-12-13 19:02:00 +00:00
Aaron Orenstein	6178be822d	dynamo tracing perf: direct Guard: 52.58 -> 51.76 (#143059 ) See #143056 for overall docs. This PR: Remove explicit constant check from `VariableBuilder.install_guards()` the args calling convention. Also remove a lambda binding. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143059 Approved by: https://github.com/williamwen42, https://github.com/jansel ghstack dependencies: #143066, #143056, #143058	2024-12-13 18:20:48 +00:00
Aaron Orenstein	6bcda3a21a	dynamo tracing perf: cache on import_source: 52.9 -> 52.58 (#143058 ) See #143056 for overall docs. This PR: add cache to `InstructionTranslatorBase.import_source()` Pull Request resolved: https://github.com/pytorch/pytorch/pull/143058 Approved by: https://github.com/jansel ghstack dependencies: #143066, #143056	2024-12-13 18:20:48 +00:00
Aaron Orenstein	b472d82c96	dynamo tracing perf: import in build: 60.48 -> 59.92 (#143056 ) A series of directed perf improvements to drive down the dynamo tracing cost of the given test. Before this PR stack the compile took about 60s, and after takes 30s. Individual improvements are listed below along with the approximate improvement of that change. Tested with this model: ``` @torch.compile(backend="eager") def model_add(x, y): out = x for i in range(5000): out = torch.add(out, y) return out ``` This PR: Stop importing builder in the inner loop of `VariableTracker.build()` Pull Request resolved: https://github.com/pytorch/pytorch/pull/143056 Approved by: https://github.com/jansel ghstack dependencies: #143066	2024-12-13 18:20:48 +00:00
Aaron Orenstein	63e1f97f4b	dynamo tracing perf: don't unnecessarily call getframeinfo on the hot path: 47.26 -> 37.66 (#143066 ) See #143056 for overall docs. This PR: Stop using `getframeinfo()` when we only care about the function name and throw the rest away. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143066 Approved by: https://github.com/jansel	2024-12-13 18:20:48 +00:00
George Wigley	e0c8abda76	Fix potentially undefined behaviour in index_put sample input (#143116 ) From the [docs](https://pytorch.org/docs/stable/generated/torch.Tensor.index_put_.html) for index_put_: > If accumulate is True, the elements in values are added to self. If accumulate is False, the behavior is undefined if indices contain duplicate elements. Currently the sample inputs for `index_put` generates 2 indices. Because they are generated randomly, they could be the same leading to undefined behaviour if `accumulate=False`. This PR changes the input generation to only generate a single index if `accumulate=False` preventing duplicate indices and undefined behaviour. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143116 Approved by: https://github.com/albanD	2024-12-13 17:59:01 +00:00
Jeremy Hadidjojo	23b8ea3094	Allow disabling int specialization on nn.Modules (#142829 ) Resolves issue #140464 by adding an option to not specialize int from nn.Modules (False by default to maintain existing behavior). Test Plan: `buck2 test mode/opt caffe2/test/dynamo:test_dynamo -- test_modules.py::NNModuleTests::test_nn_module_unspec_int_attr` Differential Revision: D66837042 Pull Request resolved: https://github.com/pytorch/pytorch/pull/142829 Approved by: https://github.com/ezyang, https://github.com/yanboliang	2024-12-13 17:26:11 +00:00
Peter Bell	82a45d19b4	Expose sharedMemPerMultiprocessor device property to python (#143119 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143119 Approved by: https://github.com/ezyang	2024-12-13 16:53:57 +00:00
Jithun Nair	3f62054de1	[ROCm] upgrade nightly wheels to rocm6.3 - 1 of 2 (docker images) (#142151 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/142151 Approved by: https://github.com/jeffdaily	2024-12-13 16:21:17 +00:00
eellison	7968732f5b	Fix int8 mm V.ops.mul dispatching (#143127 ) This is sort of subtle - because we were doing `V.ops.mul` at binding time, we dont redispatch later when we invoke the epilogue. and then later running into assertion checking in pr above. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143127 Approved by: https://github.com/drisspg ghstack dependencies: #143048	2024-12-13 16:17:23 +00:00
Tom Ritchford	da67a6a7bb	[inductor] Replace set by OrderedSet (#138466 ) Uses the set_linter from https://github.com/pytorch/pytorch/pull/138454 and considerable manual editing Pull Request resolved: https://github.com/pytorch/pytorch/pull/138466 Approved by: https://github.com/eellison	2024-12-13 16:08:45 +00:00
Zhengxu Chen	fbfc530442	[export][ez] Fix forward D67044185 (#143193 ) Summary: Fixing forward D67044185 and T210459833 by adding the missing buld file. Test Plan: buck2 build --flagfile fbcode//mode/opt fbcode//admarket/training_data/augmentation/processors/tests:model_manager_test Differential Revision: D67200056 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143193 Approved by: https://github.com/tugsbayasgalan	2024-12-13 16:06:42 +00:00
Andrey Talman	04bb82f097	Linux Wheels: Remove triton dependency python < 3.13 constraint (#143162 ) We do build pytorch-triton package for python 3.13 : https://github.com/pytorch/pytorch/actions/runs/12304476674/job/34344764271 Hence constraint is no longer needed. This stack enabled torch.compile for Python 3.13 : https://github.com/pytorch/pytorch/pull/141264 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143162 Approved by: https://github.com/kit1980	2024-12-13 15:08:44 +00:00
Yifu Wang	810808d97d	Enable cutlass-based all-gather matmul when TORCH_SYMM_MEM_ENABLE_NATIVE_ASYNC_TP is set (#142283 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/142283 Approved by: https://github.com/weifengpy, https://github.com/Chillee	2024-12-13 10:29:14 +00:00
Bin Bao	3e1f587514	[AOTI] Fix an autotune block grid computation issue (#143098 ) Summary: There is a grid computation issue after switching to one-pass codegen in https://github.com/pytorch/pytorch/pull/141980. When max-autotune is turned on, there is an incorrect grid codegen in some cases. Reviewed By: henrylhtsang Differential Revision: D67120987 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143098 Approved by: https://github.com/henrylhtsang	2024-12-13 07:52:30 +00:00
Nikita Shulga	9f90583ca2	[CI] Run aarch64 tests on Graviton3 (#143129 ) Which is armv8.6 that has SVE and BF16 capability mkldnn_pattern_matcher skips are tracked in https://github.com/pytorch/pytorch/issues/143146 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143129 Approved by: https://github.com/digantdesai	2024-12-13 07:39:22 +00:00
Nikita Shulga	c37185c76a	[BE] Stop using deprecated APIs in mkldnn_pattern_matcher (#143156 ) This should fix ``` /var/lib/jenkins/workspace/test/inductor/test_mkldnn_pattern_matcher.py:157: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/143156 Approved by: https://github.com/kit1980	2024-12-13 06:37:20 +00:00
cyy	075905b7bd	[14/N] Fix extra warnings brought by clang-tidy-17 (#141644 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/141644 Approved by: https://github.com/ezyang Co-authored-by: Eli Uriegas <1700823+seemethere@users.noreply.github.com>	2024-12-13 06:22:13 +00:00
Simon Fan	72fd7abb35	[ca] fix flex attention backward HOP capture in initial graph (#143155 ) FIXES https://github.com/pytorch/pytorch/issues/142313 So with previous HOPs, compiled autograd could just inline into their body and get their post-dispatch aten representation. You can't do that with this flex attention HOP, which just wants any proxy tracing mechanism to insert it into its graph. Okay, compiled autograd does use proxy tracing, so we can do that. This is safe because other than the reenter_make_fx call, there were no other make_fx internals usage in the HOP. And compiled autograd specializes on the AOT backward's saved symints which should cover any changes in shapes to the inputs of the HOP. However, there's still an issue: Dynamo doesn't know how to handle `FlexAttentionBackwardHOP` and will graph break, so the flex attention backward is running in eager as of this PR. The tlparse looks really scuffed after the compiled autograd capture: https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmpMMHBEH/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143155 Approved by: https://github.com/drisspg	2024-12-13 06:04:39 +00:00
Ryan Guo	b4f4c75e19	[dynamo] Support multiple inheritance for custom dict construction (#142416 ) This patch applies a local and practical workaround for custom dict construction when multiple inheritance is involved. Handling multiple inheritance in general could be a lot more involved, so I created #142414 to track that. Fixes #141118. Pull Request resolved: https://github.com/pytorch/pytorch/pull/142416 Approved by: https://github.com/jansel	2024-12-13 05:13:05 +00:00
bobrenjc93	b5d8d2444a	add README.md for compile time benchmarks (#143145 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143145 Approved by: https://github.com/laithsakka ghstack dependencies: #141517, #143143	2024-12-13 05:12:26 +00:00
lzhang2	b7ad52abb0	Use new group instead of split group on non-CUDA device (#141469 ) Motivation: Currently, `split_group` only works for NCCL backend. https://github.com/pytorch/pytorch/blob/main/torch/distributed/distributed_c10d.py#L4745. Then we need to use `use_group` on other non-CUDA device. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141469 Approved by: https://github.com/kwen2501, https://github.com/gujinghui, https://github.com/albanD	2024-12-13 05:11:33 +00:00
sanchitintel	57c46af47a	[Inductor][CPU] Add torchao da8w8 pattern with sym quantized act & wgt (#142110 ) ### Summary Extends #142036 for Inductor pattern-matching pattern covered for torchao API `int8_dynamic_activation_int8_weight` in the following scenario (inference-only, freezing enabled) - - int8 quantized (symmetrically) activation (per token quantized). - Statically (so, scales are also constant. But then they would have been constant even in case of dynamic quantization due to constant weights, anyway) per-channel int8 quantized (symmetrically) weights (which are also constant because freezing is enabled). The pattern that's matched is `torch._intmm` -> convert to FP32/BF16 -> [optional expand for activation scale] ->`mul` -> `mul`. We don't check if the activation is dynamically quantized or whether the weights are statically quantized, though (since the implementation won't have have any side-effects even if that wouldn't be true). In practice, it also matches the smooth-quant int8 quantized linear pattern if its output is not reshaped (if activation is 2D). ### More details oneDNN int8 matmul supports application of per-channel weight scale but not a vector activation scale, which could be applied as a post op, but is currently unsupported in ATen. Bias addition (which could be supported with an add post-op) is also unfused. The fusion pattern used in this PR is `torch._intmm` -> convert to FP32/BF16 ->`mul`, which will be replaced by oneDNN qlinear op. The speedup over eager-mode is due to 2 reasons - 1. fusion of int8xint8 -> int32 GEMM, conversion to FP32/BF16 & application of weight scale. (In case of BF16, many intermediate conversions are also avoided). 2. weight is pre-packed & cached by Inductor, so a reorder is avoided at run-time. But, in the future, the whole pattern (including application of activation scale, which would be a mul post-op) + bias could be fused if corresponding support would be enabled in ATen. ### Verification Added UT in this PR ``` python test/inductor/test_mkldnn_pattern_matcher.py -v -k test_da8w8_sym_act_sym_wgt_with_int_mm ``` #### Corresponding torchao UTs 1. int8 Smoothquant legacy API - `TORCHINDUCTOR_FREEZING=1 TORCH_COMPILE_DEBUG=1 TORCH_LOGS="+inductor" python test/integration/test_integration.py -v -k test_non_dynamically_quantizable_linear`. The difference from #139595 is that there are no reshapes of the linear output in this pattern. 2. int8 da8w8 - symmetrically quantized activation (dynamically) & statically quantized weights - ` TORCH_COMPILE_DEBUG=1 TORCH_LOGS="+inductor" TORCHINDUCTOR_FREEZING=1 python test/integration/test_integration.py -v -k test_int8_dynamic_quant_subclass_api_0_cpu` Pull Request resolved: https://github.com/pytorch/pytorch/pull/142110 Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5 ghstack dependencies: #142036	2024-12-13 04:59:03 +00:00
eellison	b731ced91f	Prologue Fusion (#134532 ) This PR extends our ability to fuse pointwise nodes onto triton templates with the ability to fuse pointwise nodes into triton templates - prologue fusion. Similar to the store_output api: `{{store_output(("idx_m", "idx_n"), "acc", "mask")}}` And the modification api: ``` {{ modification( subgraph_number=0, output_name="post_mod_scores", score="qk", out="qk" ) \| indent_except_first(1) }} ``` We have: ```{{load_input("B", "b", ("idx_m", "idx_n"), mask=None if EVEN_K else "b_mask", indent_width=8)}}``` Because we are now loading the input with explicit indices and mask, I needed to rewrite the mm kernel to no longer update the [pointers by BLOCK_K](`bb03ef7aca/torch/_inductor/kernel/mm.py (L110-L111)`) on every iteration and instead on each iteration compute indices from the the k_idx of each loop. This did not have any perf difference. There are a couple main use cases for prologue fusion: - Fusing dequants into a matmul. particularly for more bandwidth bound scenarios. - Fusing gather into a matmul. This is useful particularly in MOE. See https://github.com/pytorch/pytorch/issues/134535 for more details. Prologue fusion is generally much less profitable than epilogue fusion, because it must be applied to an element of an input on each loop of the matmul, compared to only once in the epilogue (gather into matmul is a potential exception). Accordingly, we are much less aggressive in attempting to fuse prologue fusion. We only attempt fusion if it does not increase the number of memory bytes read instead the triton template, multipled by a small factor to allow gathers. This restricts reliably unprofitable fusions like fp32->fp16 inside kernel. In future pr we could potentially have api of being more aggressive if we know we are in a bandwidth bound regime. See: https://github.com/pytorch/pytorch/pull/134532/files#diff-d2539c9c8dc6a3d7e457767a880612e96d3c85752a77ead49a9e4e00a3e4c3c7R3060-R3066 Other notes: By default we will upcast to fp32 inside every kernel. This matches eager numerics. This is fine enough for epilogue because it is only done once (although it is probably unnecessary for say a relu) but tanks perf for prologue. I am currently using the `codegen_upcast_to_fp32` option to avoid it, but that will not work for libdevice calls that require fp32. We will need https://github.com/pytorch/pytorch/pull/136778/ and dtype-aware codegen to upcast fp16 ops into libdevice calls. With prologue fusion, we now have essentially separate kernels for each input, and for the output. I had to increase the number of fields that are swapped out in `set_subgraph_body` by a large number :/ I also update the fusion logic because the inputs will have a different group than the outputs. Maybe as part of enabling multiple outputs, this could get cleaned up a bit so.. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134532 Approved by: https://github.com/jansel	2024-12-13 04:18:25 +00:00
bobrenjc93	ceb664aca6	add float_args benchmark (#143143 ) 71% improvement with automatic dynamic float arguments with specialize_float=False ``` float_args,compile_time_instruction_count,346293869 ``` with specialize_float=True ``` float_args,compile_time_instruction_count,1198546486 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/143143 Approved by: https://github.com/laithsakka ghstack dependencies: #141517	2024-12-13 03:35:59 +00:00
Simon Fan	ab04f3aee1	[ca] set autograd graph task state (#143108 ) GraphTask holds metadata needed for a single execution of backward(), it is 1:1 with backward calls, at least for compiled autograd. It is used for certain torch._C global autograd state APIs. In SAC, we use torch._C._current_graph_task_id() as a dict key to store information during unpack hook execution: `a5fb07af27/torch/utils/checkpoint.py (L1128)` If we don't set an active task, it will randomize the key, and will do its logic as if each unpacked tensor was from a different graph task `a5fb07af27/torch/utils/checkpoint.py (L1112-L1115)` The sketchy part of this PR is that in eager autograd, GraphTask is mutated during execution. But inspecting the struct, the mutation seems to only be used to communicate between autograd threads (created when multiple devices are involved) or for deprecated uses. We shouldn't run into the mutation case at all in compiled autograd. Also, only the graph task id is accessible from python hooks. FIXES https://github.com/pytorch/pytorch/issues/142862 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143108 Approved by: https://github.com/jansel, https://github.com/albanD	2024-12-13 03:10:48 +00:00
Blaine Burton Rister	dbe4b69df0	[Inductor] Fix cooperative reduction tests broken in recent refactor (#143135 ) These tests were broken by https://github.com/pytorch/pytorch/pull/142020. This PR updates the fixed configs accordingly. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143135 Approved by: https://github.com/jansel, https://github.com/huydhn	2024-12-13 02:03:43 +00:00
cyy	9f5ebf3fc6	Clang-format aten/src/ATen/native/Tensor*{cpp,h} (#143089 ) These files are relatively stable, so it should be safe to format them without incurring conflicts Pull Request resolved: https://github.com/pytorch/pytorch/pull/143089 Approved by: https://github.com/albanD	2024-12-13 00:06:48 +00:00
Wouter Devriendt	2533a5a843	upgrade sccache to 0.9.0 (#142854 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/142854 Approved by: https://github.com/malfet, https://github.com/ZainRizvi	2024-12-12 22:49:50 +00:00
Xia, Weiwen	fb93462904	[Reopen][Inductor][CPU] Fuse SmoothQuant int8 linear pattern (#142036 ) Reopen of https://github.com/pytorch/pytorch/pull/139595 About the PR In the implementation of SmoothQuant in Torchao, quantized linear is computed by `_int_mm(a, b)` + `mul(b_scale)` + `mul(a_scale)` (+ optional `add` for bias) with `reshape` and `convert_dtype` in between. This PR adds a pass to fuse the corresponding patterns: - (no bias) `reshape -> _int_mm -> convert_element_type -> (expand -> mul) -> mul -> reshape` - (with bias) `pattern_no_bias -> add -> reshape -> reshape` The patterns are replaced by `onednn.qlinear_pointwise` and `onednn.qlinear_prepack`, the latter of which is evaluated and frozen during the freezing process of Inductor. The final graph contains `onednn.qlinear_pointwise` only with packed weight constants. Note that `onednn.qlinear_pointwise` only supports a scalar activation scale, which is a limitation of oneDNN library, so in that case we set activation scale to 1 and bias to none and apply scales and add bias after `onednn.qlinear_pointwise`. Validation results Accuracy/perplexity is not changed with or without this fusion pass. Latency is improved by >10% with the fusion pass. Test method: - Model: EleutherAI/gpt-j-6b - Hardware: Intel(R) Xeon(R) Platinum 8490H, running on 1 socket, 60 cores - Using Intel OMP and Tcmalloc - Running [the example script of SmoothQuant in Torchao](https://github.com/pytorch/ao/blob/main/torchao/prototype/smoothquant/example.py) with `TORCHINDUCTOR_FREEZING=1 numactl -N1 python example.py -m EleutherAI/gpt-j-6b --device=cpu --quant-mode=dynamic --compile` Test plan ``` python test/inductor/test_mkldnn_pattern_matcher.py -k test_smooth_quant_with_int_mm ``` Differential Revision: [D66796966](https://our.internmc.facebook.com/intern/diff/D66796966) Pull Request resolved: https://github.com/pytorch/pytorch/pull/142036 Approved by: https://github.com/jerryzh168, https://github.com/jgong5 Co-authored-by: sanchitintel <sanchit.jain@intel.com>	2024-12-12 21:18:03 +00:00
Chien-Chin Huang	602c86a420	[DSD] Fix strict=False case for DDP (#143038 ) Summary: As title Pull Request resolved: https://github.com/pytorch/pytorch/pull/143038 Approved by: https://github.com/mori360	2024-12-12 21:15:21 +00:00
Adrien Aguila--Multner	a7509e98c5	[pipelining] fix backward_one_chunk when the output of the model is a… (#142237 ) fixes #142229 if any of ``stage_output`` is a view, it cannot be detached in place. Replacing it with ``t = t.detach()`` or similar would not free the graph for the output given to the user. Detaching the base tensor could cause a side effect. The same code is used in ``_backward.py`` (`b64a537993/torch/distributed/pipelining/_backward.py (L215)`) but does not seem to cause any issue in my case. Maybe needs some investigation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/142237 Approved by: https://github.com/H-Huang	2024-12-12 20:59:35 +00:00
Huy Do	39cacc1d81	Fix missing tests on test tool lint job (#143052 ) A follow-up from https://github.com/pytorch/pytorch/pull/142476#discussion_r1878888558 where some tests are not discovered correctly by pytest ### Testing https://github.com/pytorch/pytorch/actions/runs/12287448581/job/34289531307?pr=143052#step:14:162 shows the correct number of tests now Pull Request resolved: https://github.com/pytorch/pytorch/pull/143052 Approved by: https://github.com/ZainRizvi	2024-12-12 20:29:32 +00:00
Richard Barnes	82ce888273	c10::string_view -> std::string_view in more places (#142517 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/142517 Approved by: https://github.com/malfet	2024-12-12 19:45:59 +00:00
eellison	0b75b7ff2b	[Easy] factor out inductor ophandler decompositions (#142400 ) Factor out inductor operator decompositions Pull Request resolved: https://github.com/pytorch/pytorch/pull/142400 Approved by: https://github.com/Chillee, https://github.com/jansel	2024-12-12 19:03:26 +00:00
Shivam Raikundalia	c170248b78	[Profiler] Enable Iterative Step without profiler in fbcode (#142077 ) Summary: Adds post optimizer hook for fbcode so that we can run iterative on demand without having to use a frontend profiler interface. Since this is being used more frequently, it would be convenient for users to be able to trigger this on-demand feature without having to worry about being within some timing window. Test Plan: Ran iterative tracing without profiler.profile Differential Revision: D66734119 Pull Request resolved: https://github.com/pytorch/pytorch/pull/142077 Approved by: https://github.com/briancoutinho	2024-12-12 19:00:13 +00:00
atalman	e3fe5f62b6	Remove Checkout pytorch/builder for Linux Binary Builds (#143125 ) Follow Up after: https://github.com/pytorch/pytorch/pull/142282 Remove Checkout pytorch/builder for Linux Binary Builds I believe we where not using builder already. Hence remove this checkout. We should be using scripts from this folder: ``` /pytorch/.ci/${{ inputs.PACKAGE_TYPE }}/build.sh ``` TODO: Will followup with removing BUILDER_ROOT everywhere from PyTorch repo Pull Request resolved: https://github.com/pytorch/pytorch/pull/143125 Approved by: https://github.com/kit1980	2024-12-12 18:55:00 +00:00
PyTorch MergeBot	d48b16a725	Revert "[Dynamo] only import einops if version is lower than 0.7.0 (#142847 )" This reverts commit 357e261b1eded933d98de18ddcef2b083f87259d. Reverted https://github.com/pytorch/pytorch/pull/142847 on behalf of https://github.com/atalman due to Breaks binary builds, see the comment above ([comment](https://github.com/pytorch/pytorch/pull/142847#issuecomment-2539759580))	2024-12-12 18:44:35 +00:00
Howard Huang	b0c3d39e0d	[pipelining] Update tutorials and documentation (#143045 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143045 Approved by: https://github.com/wconstab, https://github.com/kwen2501	2024-12-12 18:42:17 +00:00
Zhengxu Chen	ee5bceaee6	[sigmoid] Write the new export schema format to archive without breaking compatibility. (#142511 ) Summary: This diff make it possible to migrate to PyTorch's OSS export schema from sigmoid. Basically, we add a new field called "methods" to ExportedProgram in Model definition, which contains the thrift schema generated based on schema.py from OSS. This way, we can keep writing the old fields while double write a new format in equivalent form. Since thrift doesn't support inlining type definitions, we do it manually here and it shouldn't break on-wire compatibility. As long as every sigmoid user is using sigmoid.frontend.serialization.serialize, we always guarantee to have the new format saved sa well. Eventually we will will use json deserialization from OSS so we will only keep this double writing for a couple of months. Eventually, we will migrate every serialization path to the OSS workflow. Test Plan: buck test mode/opt sigmoid/frontend:serialization_test buck test mode/opt sigmoid/frontend/test_gpu:serializer_test Differential Revision: D67044185 Pull Request resolved: https://github.com/pytorch/pytorch/pull/142511 Approved by: https://github.com/desertfire	2024-12-12 18:41:10 +00:00
Joel Schlosser	5dabe2d464	Fix NJT backward tests (#143072 ) This PR fixes some issues with NJT backward / compile backward tests: 1. `requires_grad` was not being propagated appropriately during `SampleInput` generation, so a LOT of backward cases were untested before (sad times). This PR utilizes a helper function `_clone()` to clone() / detach() NJTs for SampleInputs while preserving `requires_grad` status. Note: the clone() / detach() stuff is for autograd; can't have two SampleInputs as part of the same autograd graph. 2. Per-sample skips weren't -fully- working; the op logic would still be invoked even with a skip. I found this out thanks to `split_with_sizes`, which segfaults during backwards because it tries to use an NST-specific formula. As annoying as it is, I tried a ton of things but ultimately had to split the `subtest_ctx` into that + a `skip_xfail_ctx` to run the subtests within. * Updated all uses of per-sample skips / xfails: 4 in `test_nestedtensor.py` and 1 in `test_vmap.py` 3. Added the appropriate skips / xfails to get everything passing. There are a shitton of bugs to fix! Pull Request resolved: https://github.com/pytorch/pytorch/pull/143072 Approved by: https://github.com/cpuhrsch, https://github.com/soulitzer	2024-12-12 18:06:23 +00:00
Xuehai Pan	d47a80246a	[dynamo][pytree][3/N] make CXX pytree traceable: `tree_map` / `tree_map_` (#137399 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137399 Approved by: https://github.com/jansel ghstack dependencies: #137398	2024-12-12 18:05:25 +00:00
Xuehai Pan	7edeb1005a	[dynamo][pytree][2/N] make CXX pytree traceable: `tree_flatten` / `tree_unflatten` / `tree_structure` (#137398 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137398 Approved by: https://github.com/jansel	2024-12-12 18:05:25 +00:00
PyTorch MergeBot	c85323c5e8	Revert "Tests Generelization for multiple accelerator devices (#139184 )" This reverts commit b576a8c318201b63269f7ff25ec5830d00662a7a. Reverted https://github.com/pytorch/pytorch/pull/139184 on behalf of https://github.com/clee2000 due to Failing internally when trying to pickle distributed test files D67098795 ([comment](https://github.com/pytorch/pytorch/pull/139184#issuecomment-2539610187))	2024-12-12 17:48:30 +00:00
PyTorch MergeBot	2f0fe82f6d	Revert "[14/N] Fix extra warnings brought by clang-tidy-17 (#141644 )" This reverts commit 24a5a2ef258d2b482ded674cdb9555afaf081402. Reverted https://github.com/pytorch/pytorch/pull/141644 on behalf of https://github.com/clee2000 due to failing internally D67112938 ([comment](https://github.com/pytorch/pytorch/pull/141644#issuecomment-2539602023))	2024-12-12 17:43:36 +00:00
Tom Ritchford	dc23f1944a	Remove unused Python variables in torch/[_-a]* (#133492 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133492 Approved by: https://github.com/albanD	2024-12-12 17:39:14 +00:00
Richard Barnes	7667235a23	c10::optional -> std::optional (#142514 ) Fixes issues introduced in https://github.com/pytorch/pytorch/pull/141348 and https://github.com/pytorch/pytorch/pull/139578 Pull Request resolved: https://github.com/pytorch/pytorch/pull/142514 Approved by: https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2024-12-12 17:23:46 +00:00
Blaine Burton Rister	520ba556cd	[Inductor] Refactor "r" reduction prefix to {"r0_", "r1_"}. (#142020 ) Preparatory refactor for https://github.com/pytorch/pytorch/pull/137243. # Feature This PR changes the `RINDEX` / `"r"` symbol type to `(R0_INDEX, R1_INDEX)` and `("r0_", "r1_")`, respectively. This allows the relevant code to support 2D (often ND) reductions. Unlike the parent PR, this one does not change the tiling algorithm, so `"r1_"` is never used. However, it prepares other parts of the system to handle `"r1_"` once we start using it. This should significantly reduce the chances of hitting merge conflicts, making the parent PR much easier to land. The only change to the generated triton code is to rename `"rindex"` -> `"r0_index"`, `"RBLOCK"` -> `"R0_BLOCK"`, etc. To maintain compatibilty with existing codegen, this also generates aliases to the old reduction variables like `rindex = r0_index`. If we generated 2D reductions (which this PR will not do), the aliases would be more complicated and would collapse 2D multi-indices to linear indices. See some example kernels in the parent PR. These aliases can be eliminated by the Triton compiler, and should not impact the final machine code running on the GPU. See the perf testing in the parent PR which confirms the aliases do not impact perf. # Test plan The existing CI provides good coverage. This PR modifies the expected code in a few places, renaming reduction variables from `r.` to `r0_.`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/142020 Approved by: https://github.com/jansel Co-authored-by: Jason Ansel <jansel@meta.com>	2024-12-12 17:22:20 +00:00
PyTorch MergeBot	cf538efd0c	Revert "Hide torch_python symbols (#142214 )" This reverts commit da76e912a4c58c649061fc84b29a42714897a0ca. Reverted https://github.com/pytorch/pytorch/pull/142214 on behalf of https://github.com/huydhn due to The MacOS failure looks legit as it shows up in trunk ([comment](https://github.com/pytorch/pytorch/pull/142214#issuecomment-2539543504))	2024-12-12 17:15:51 +00:00
Simon Fan	15ee2960e1	[aot] Functionalize aot backward prologue and epilogue wrappers (#142415 ) For functional compiled autograd, we're having dynamo trace through the aot backward implementation. To avoid graph breaking and imposing too many restrictions, we allow_in_graph the prologue and epilogue. This adds 2 restrictions: - code must be available in the global context - inputs other than tensors/symnodes must be const foldable Pull Request resolved: https://github.com/pytorch/pytorch/pull/142415 Approved by: https://github.com/bdhirsh	2024-12-12 17:14:29 +00:00
Sam Larsen	30b61e521c	[logging] Populate compile_time_autotune_time_us (#143104 ) See testing in attached diff Differential Revision: [D67128210](https://our.internmc.facebook.com/intern/diff/D67128210) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143104 Approved by: https://github.com/ezyang	2024-12-12 17:08:43 +00:00
Yasyf Mohamedali	e3ddc0ca33	Support remote caching requiring redis auth (#141679 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/141679 Approved by: https://github.com/masnesral	2024-12-12 17:07:50 +00:00
Svetlana Karslioglu	0f78be5573	Fix search icon (#142808 ) Removing: .pytorch-left-menu-search input[type=text] { background-image: none; } so that the search icon correctly appears in the sphinx searchbox Also, fixing scrolling Pull Request resolved: https://github.com/pytorch/pytorch/pull/142808 Approved by: https://github.com/albanD	2024-12-12 16:09:30 +00:00
eellison	725526abc5	Fix scan dtypes (#143048 ) FIx for https://github.com/pytorch/pytorch/issues/142883. We weren't getting test coverage of scan because the tests were being skipped. see, https://github.com/pytorch/pytorch/issues/143053 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143048 Approved by: https://github.com/arui-meta, https://github.com/blaine-rister	2024-12-12 15:57:00 +00:00
Nikita Shulga	d83a049232	[EZ] Update lintrunner in CI to 0.12.7 (#143073 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143073 Approved by: https://github.com/wdvr	2024-12-12 15:35:37 +00:00
drisspg	7cc3a591c2	[FlexAttention] Fix a few more symbolic shape issues (#142816 ) # Summary See https://github.com/pytorch/pytorch/issues/139064 for more details. This fixes a number of issues with dynamic shapes. Thanks to @alexdremov for finding most of these Pull Request resolved: https://github.com/pytorch/pytorch/pull/142816 Approved by: https://github.com/yanboliang, https://github.com/ezyang	2024-12-12 15:29:21 +00:00
atalman	84f791381a	Python 3.13 CI add crossref test to existing linux-focal-py3_13-clang10-build (#143074 ) Add linux-jammy-py3_13-gcc11-build and test - similar to Py 3.9 Add crossref test to existing linux-focal-py3_13-clang10-build Pull Request resolved: https://github.com/pytorch/pytorch/pull/143074 Approved by: https://github.com/malfet	2024-12-12 14:45:56 +00:00
PyTorch MergeBot	cd1b5924d5	Revert "[Inductor] Use sleef implementation for CPP backend asinh codegen (#142360 )" This reverts commit 79cf8fa75176a8f6bb79d426c6d0f9369d03ff98. Reverted https://github.com/pytorch/pytorch/pull/142360 on behalf of https://github.com/jeanschmidt due to seems to have broken macos tests ([comment](https://github.com/pytorch/pytorch/pull/142360#issuecomment-2539143039))	2024-12-12 14:42:55 +00:00
Edward Z. Yang	30e2b322a1	Add <string> to uninteresting_files (#142984 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/142984 Approved by: https://github.com/albanD, https://github.com/IvanKobzarev	2024-12-12 14:35:30 +00:00
gasoonjia	91261107e0	debug handler maintain through decomposition (#141612 ) Add checks in the ao numberic debugger to guard the debug handle consistency between aten op decomposition Differential Revision: [D66517480](https://our.internmc.facebook.com/intern/diff/D66517480/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/141612 Approved by: https://github.com/jerryzh168	2024-12-12 12:26:45 +00:00
Xuehai Pan	18785c1af9	[BE][accelerator] formalize API name `{current,set}_device_{idx => index}` (#140542 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/140542 Approved by: https://github.com/guangyey, https://github.com/albanD	2024-12-12 10:53:48 +00:00
Saiteja Samudrala	a5fb07af27	[Torch][#142396 ]Resolve Failure When Uploading To Remote Storage (#143046 ) Summary: Catch io.UnsupportedOperation exception so that stream's without fileno support don't cause failure Test Plan: UT Differential Revision: D67108487 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143046 Approved by: https://github.com/saumishr	2024-12-12 08:17:15 +00:00
Avik Chaudhuri	497f89ff83	fix dynamo nn module stack fqn (#142823 ) Dynamo can produce sources that have funny patterns in their `.name()` that break `nn_module_stack` fqns. Added a test that used to have `._modules` inside nn_module_stack fqns, now doesn't. (Unfortunately couldn't repro a case mentioned in the GH issue where `.slice(...)` is claimed to appear as well.) Fixes https://github.com/pytorch/pytorch/issues/141939 Differential Revision: [D67064189](https://our.internmc.facebook.com/intern/diff/D67064189/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/142823 Approved by: https://github.com/pianpwk, https://github.com/zhxchen17	2024-12-12 07:02:13 +00:00
cyyever	da76e912a4	Hide torch_python symbols (#142214 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/142214 Approved by: https://github.com/ezyang	2024-12-12 07:00:54 +00:00
Nichols A. Romero	dcb128d495	[ROCm] TunableOp use thread-safe getenv functions (#142274 ) Fixes #142403 ~~PR fixes breakage due to this commit `8cd7ad8b48`~~ PR is a partial reland of this https://github.com/pytorch/pytorch/pull/140594 with a unit test. Pull Request resolved: https://github.com/pytorch/pytorch/pull/142274 Approved by: https://github.com/jeffdaily, https://github.com/eqy	2024-12-12 06:49:26 +00:00
Xilun Wu	5ad7d5304c	[DTensor][random] add HSDP+TP model init test (#143077 ) Summary 1. Move the model init tests from `DistTensorRandomOpTest` to `DistTensorRandomInitTest` 2. Added a HSDP+TP meta init test to show correct model init result in this use case. Note that this test requires 8 GPUs to run and our CI doesn't have that capacity so this test will be skipped on CI testing. A local run shows that the test passes on a 8-GPU host. Test `pytest test/distributed/_tensor/test_random_ops.py -s -k test_hsdp_tp_model_meta_init` <details> <summary> Test Result </summary> <img width="3343" alt="image" src="https://github.com/user-attachments/assets/a960c5e6-37bc-49be-9e36-ecc29ed47eb0" /> </details> Pull Request resolved: https://github.com/pytorch/pytorch/pull/143077 Approved by: https://github.com/weifengpy	2024-12-12 06:46:16 +00:00
Michael Lazos	357e261b1e	[Dynamo] only import einops if version is lower than 0.7.0 (#142847 ) Fixes internal xref (https://fb.workplace.com/groups/257735836456307/posts/804793021750583/?comment_id=805229281706957&reply_comment_id=805232695039949) Pull Request resolved: https://github.com/pytorch/pytorch/pull/142847 Approved by: https://github.com/zou3519	2024-12-12 06:38:22 +00:00
Michael Lazos	9701c50bdc	[Dynamo] Add missing tensor builtins to allowed functions (#142841 ) Fixes https://github.com/pytorch/pytorch/issues/141232 Pull Request resolved: https://github.com/pytorch/pytorch/pull/142841 Approved by: https://github.com/yanboliang	2024-12-12 06:38:19 +00:00
YeJiaxi	b25f64b613	Add-o pipefail for all bash scripts (#143050 ) Fixes #142380 I have added -o pipefail in all bash scripts in pytorch/.ci/pytorch. Sorry I didn't double-check the submodule in my last PR. Thanks for the correction! Please contact me again if there are any problems with this fix^^. (Actually contributing to the open source community is an assignment for one of my courses and today is the deadline so I rushed to revise it when I saw an email early in the morning. Haha.) @ezyang @malfet @huydhn @zou3519 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143050 Approved by: https://github.com/ezyang, https://github.com/huydhn Co-authored-by: Edward Z. Yang <ezyang@mit.edu>	2024-12-12 06:18:41 +00:00
leslie-fang-intel	79cf8fa751	[Inductor] Use sleef implementation for CPP backend asinh codegen (#142360 ) Summary Fix https://github.com/pytorch/pytorch/issues/142345. Previously, we use `asinh(x) = log(x + sqrt(1 + x2))` to calculate the result of `asinh`, the issue happens when input with `-10000.1`, which makes `x + sqrt(1 + x2)` close to 0 and log(0) is invalid. We use the `sleef` implementation in this PR to fix this issue. Test Plan ``` python -u -m pytest -s -v test/inductor/test_cpu_repro.py -k test_asinh_with_corner_inputs ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/142360 Approved by: https://github.com/jgong5	2024-12-12 05:40:48 +00:00
Jithun Nair	1e2b841675	[ROCm] Prune old gfx archs gfx900/gfx906 from binaries (#142827 ) Remove gfx900 and gfx906 archs as they're long-in-the-tooth. Should help reduce the increasing size of ROCm binaries. Pull Request resolved: https://github.com/pytorch/pytorch/pull/142827 Approved by: https://github.com/jeffdaily	2024-12-12 05:33:40 +00:00
cyy	fda43c98d1	Improve implementation of quantized_batch_norm (#141570 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/141570 Approved by: https://github.com/albanD	2024-12-12 04:35:00 +00:00
cyy	20df80a669	Remove unneeded optional dereference (#141578 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/141578 Approved by: https://github.com/swolchok	2024-12-12 04:34:43 +00:00
cyy	f7b9533c3f	[4/N] Apply bugprone-unchecked-optional-access (#142832 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/142832 Approved by: https://github.com/albanD	2024-12-12 04:33:32 +00:00
James Wu	fbbafd0320	Turn on AOTAutogradCache by default on open source (#141981 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/141981 Approved by: https://github.com/bdhirsh, https://github.com/oulgen	2024-12-12 04:21:11 +00:00
mori360	4d0775462e	E2E composability testing (#141398 ) Add 3D(pp+tp+fsdp) test `test_3d_with_tp_dp_pp` at test_pp_compodability Currently provide @parametrize on "ScheduleClass" for pp in [ScheduleGPipe, Schedule1F1B, ScheduleInterleaved1F1B, ScheduleLoopedBFS, ScheduleInterleavedZeroBubble] "MixedPrecisionParam" for fsdp in [torch.bfloat16, torch.float32] Future work: 1. add fp8 2. add cp(context parallelism) to enable 4D test Pull Request resolved: https://github.com/pytorch/pytorch/pull/141398 Approved by: https://github.com/wconstab, https://github.com/kwen2501	2024-12-12 04:19:29 +00:00
cyy	2903cf0ad8	Re-enable some C++ warnings (#142332 ) It enables some C++ warnings since the code base is fairly clean. Meanwhile, Wextra-semi is disabled on CUDA generated code since there is no way to fix them without the cooperation of CUDA team. Pull Request resolved: https://github.com/pytorch/pytorch/pull/142332 Approved by: https://github.com/albanD, https://github.com/eqy	2024-12-12 04:02:12 +00:00
Carlo Bertolli	f892f9862a	[ROCM] Enable *_load_dwordx4 ISA for BFloat16 and Half. (#141397 ) Remove input_vec_size constexpr and move it to template parameter. This enables generation of vectorized loads in ROCm AMDGPU backend. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141397 Approved by: https://github.com/jeffdaily Co-authored-by: Jerry Mannil <jerry.mannil@amd.com>	2024-12-12 03:27:49 +00:00
Nikita Shulga	4d8357e912	[CD] Use Anaconda cmake for Mac builds (#143054 ) To find Anaconda-env-installed OpenMP (As OpenMP from PyPI is looking for it in a different places) For posterity: our build script names are very confusing: - [`.ci/wheel/build_wheel.sh`](`6cb6e8d790/.ci/wheel/build_wheel.sh`) is only used for MacOS wheel/libtorch builds - [`.ci/manywheel/build.sh`](`6cb6e8d790/.ci/manywheel/build.sh`) are used for Linux wheel/libtorch builds - [`.ci/pytorch/windows/build_pytorch.bat`](`6cb6e8d790/.ci/pytorch/windows/build_pytorch.bat`) is used for Windows wheel builds Fixes https://github.com/pytorch/pytorch/issues/142873 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143054 Approved by: https://github.com/Jack-Khuu, https://github.com/atalman	2024-12-12 03:05:46 +00:00
Ke Wen	cb354f8b47	[PGNCCL] Move NCCLComm impl to cpp (#142826 ) BE as titled. No behavior change. Pull Request resolved: https://github.com/pytorch/pytorch/pull/142826 Approved by: https://github.com/wconstab, https://github.com/c-p-i-o	2024-12-12 02:45:52 +00:00
leslie-fang-intel	06075d3d18	[Inductor][CPP] Fix Mask Dtype mismatch (#142103 ) Summary Fix issue: https://github.com/pytorch/pytorch/issues/141559. The `vec_mask` store data type doesn't aligned when doing `bitwise_and`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/142103 Approved by: https://github.com/jgong5	2024-12-12 01:21:32 +00:00
Colin L. Rice	d68403df3b	filelock: Make waitcounter variant to use (#139816 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139816 Approved by: https://github.com/ezyang	2024-12-12 01:18:34 +00:00
atalman	6cb6e8d790	Python 3.11, 3.12 Remove tests covered by 3.13 (#143078 ) We do have linux-focal-py3_13-clang10-build and test. Hence removing linux-focal-py3_11-clang10-build/test and linux-focal-py3_12-clang10-build/test Pull Request resolved: https://github.com/pytorch/pytorch/pull/143078 Approved by: https://github.com/huydhn, https://github.com/malfet	2024-12-12 01:12:00 +00:00
atalman	1dd6f21029	Cuda 12.1 - Remove from trunk tests (#143076 ) Remove cuda 12.1 from trunk tests. This is covered by 12.4 tests. Move ``libtorch-linux-focal-cuda12_4-py3_7-gcc9-debug-build`` -> ``libtorch-linux-focal-cuda12_4-py3_10-gcc9-debug-build`` Pull Request resolved: https://github.com/pytorch/pytorch/pull/143076 Approved by: https://github.com/huydhn, https://github.com/malfet	2024-12-12 01:10:09 +00:00
atalman	bd7d81db9e	Use validate-docker-images workflow from test-infra (#143081 ) After PR: https://github.com/pytorch/test-infra/pull/6029 use validate-docker-images.yml from test-infra. Related to: https://github.com/pytorch/builder/issues/2054 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143081 Approved by: https://github.com/huydhn	2024-12-12 00:24:27 +00:00
cyy	db81a3f31c	[TorchGen] remove remove_non_owning_ref_types from valuetype_type (#142449 ) It is not used Pull Request resolved: https://github.com/pytorch/pytorch/pull/142449 Approved by: https://github.com/ezyang	2024-12-12 00:15:44 +00:00
PyTorch MergeBot	1b3f8b7589	Revert "[RELAND] Add UTs for accelerator device-agnostic runtime APIs (#133572 )" This reverts commit 209119424922b135fef39aba1f25da3b67f5879a. Reverted https://github.com/pytorch/pytorch/pull/133572 on behalf of https://github.com/huydhn due to Sorry for reverting your change but the new test is still very flaky on MacOS even when it does not segfault anymore ([comment](https://github.com/pytorch/pytorch/pull/133572#issuecomment-2537256522))	2024-12-11 21:47:18 +00:00
PyTorch MergeBot	dfe5669076	Revert "[RELAND] Add device-agnostic runtime Device/Stream C++ API (#138677 )" This reverts commit 734bb01460d59e661e9114e7aa17e04821e4b57a. Reverted https://github.com/pytorch/pytorch/pull/138677 on behalf of https://github.com/huydhn due to Sorry for reverting your change but the new test is still very flaky on MacOS even when it does not segfault anymore ([comment](https://github.com/pytorch/pytorch/pull/133572#issuecomment-2537256522))	2024-12-11 21:47:17 +00:00
PyTorch MergeBot	cd50bd8477	Revert "[BE][accelerator] formalize API name `{current,set}_device_{idx => index}` (#140542 )" This reverts commit fb02b40d27737213e0547dec0e30977dfc50f2f3. Reverted https://github.com/pytorch/pytorch/pull/140542 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but I need to revert this in order to revert https://github.com/pytorch/pytorch/pull/133572#issuecomment-2537204202 due to a conflict ([comment](https://github.com/pytorch/pytorch/pull/140542#issuecomment-2537253665))	2024-12-11 21:44:23 +00:00
Michael Lazos	de313f1155	[foreach_map] Initial foreach map HOP impl for inference (#142098 ) This is the initial foreach map HOP for pointwise ops which will be extended in the future to support grouped GEMMs and other ops. This PR utilizes PrimHOPBase class to represent foreach_map as a HOP with a single subgraph. The way this is implemented is that the user API `foreach_map` provides a single pointwise torch op, and internally this function calls a polyfill which has the same semantics as a foreach op (ie iterates over lists of operands applying the op elementwise). The higher order op is passed through the stack down to inductor where a lowering in essence inlines the subgraph into the main graph. This is done by interpreting it with a pointwise subgraph lowering, grouping the outputs by device, and registering the output buffers as foreach groups as applicable. For testing I was able to reuse the existing foreach tests by creating a wrapper function which matches the foreach op interfaces for those tests and then run all of the existing foreach tests on foreach_map. TODO before landing: * Add tests for general functions * Test warning if unsupported op will block fusion Followups: * I need to add tests for backwards (this will be a followup PR because backwards will require other work as well) Pull Request resolved: https://github.com/pytorch/pytorch/pull/142098 Approved by: https://github.com/eellison	2024-12-11 21:32:11 +00:00
Nikita Shulga	bd199bc754	[EZ] Move slow job from CU12.1 to CU12.4 (#142856 ) I though it was migrated a while back Pull Request resolved: https://github.com/pytorch/pytorch/pull/142856 Approved by: https://github.com/huydhn, https://github.com/atalman, https://github.com/ZainRizvi	2024-12-11 21:12:35 +00:00
Tristan Rice	688f44824b	DistributedDataParallel: add init_sync option to control collectives during initialization (#142824 ) This controls whether or not we run collectives during the DDP init function. This makes it easier to use fault tolerant ProcessGroup implementations that may not be starting at the same time. torchft uses a dummy process group and a comm hook to get around these checks. With this change torchft can use the normal ProcessGroup API via the stock comm hook. https://github.com/pytorch-labs/torchft/blob/main/torchft/ddp.py#L50-L59 Test plan: ``` pytest test/distributed/test_c10d_pypg.py ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/142824 Approved by: https://github.com/wconstab, https://github.com/fegin, https://github.com/H-Huang	2024-12-11 20:28:38 +00:00
Jane Xu	fd65bd755d	[BE] replace incorrect .. note:: invocations (#142868 ) Something I've noticed is that a lot of the distributed sites don't render on our docs at all, but if they ever do, the notes will render properly now 😛 Pull Request resolved: https://github.com/pytorch/pytorch/pull/142868 Approved by: https://github.com/albanD	2024-12-11 19:58:18 +00:00
Edward Z. Yang	0b96413dbf	Upgrade expecttest to 0.3.0 (#142869 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/142869 Approved by: https://github.com/albanD, https://github.com/malfet	2024-12-11 19:04:16 +00:00
cyy	e5f08c0cbf	[TorchGen] Remove cpp_type_registration_declarations (#142452 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/142452 Approved by: https://github.com/ezyang	2024-12-11 19:01:36 +00:00
cyy	e228381846	[TorchGen] Simplify argument_type_str (#142491 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/142491 Approved by: https://github.com/ezyang	2024-12-11 19:01:20 +00:00
Nikita Shulga	42d4eec5f3	Don't install lintrunner on S390 (#142876 ) Not sure if there are many users of this platform, but hopefully this will fix https://github.com/pytorch/pytorch/issues/142872 Pull Request resolved: https://github.com/pytorch/pytorch/pull/142876 Approved by: https://github.com/jeanschmidt	2024-12-11 18:54:12 +00:00
Yukio Siraichi	e647b6d590	Fix undesired specialization on slice after split. (#142372 ) Fix: #141251 This PR adds a few static guard checks when decomposing and lowering the `slice` operation, so that we avoid adding unnecessary guards. Specifically, when clamping the end values. In summary, the changes are: - `slice` dynamo decomposition: checks `end >= sizes[dim]` statically. If we don't know that, the following guard ensures that we (don't) need clamping. - `evaluate_min` inductor `sizevar` function: checks whether we can solve it statically or not, before actually creating a new guard. The latter had to be changed because `evaluate_min` (called by `ir.SliceView` constructor) would always try to create a guard based on the hints operation result. However, if both `left` and `right` hints were true, it would default to `left <= right` guard. By checking the guards statically before, we can avoid that. ```python N = 16 @torch.compile(backend="inductor", dynamic=False, fullgraph=True) def fn(x): splits = torch.ops.aten.split.Tensor(x, N) first = splits[0] return torch.ops.aten.slice.Tensor(first, 0, 0, N) x = torch.arange(N) torch._dynamo.mark_dynamic(x, 0) fn(x) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/142372 Approved by: https://github.com/ezyang	2024-12-11 18:52:17 +00:00
titaiwangms	0ddb33ba22	[ONNX] Avoid overwriting overlapped decomposed functions (#142831 ) Fixes #141770 The decomposed function in `torch.export.default_decompositions().items()` is overwritten by `torch._decomp.decomposition_table`. As from `torch.onnx.export()` perspective, we should rather respect the table of decompositions in `torch.export.default_decompositions().items()` and avoid overwriting it with `torch._decomp.decomposition_table. Pull Request resolved: https://github.com/pytorch/pytorch/pull/142831 Approved by: https://github.com/justinchuby	2024-12-11 18:47:40 +00:00
Yidi Wu	c632e29774	[hop][dynamo] support torch.SymInt inputs (#141524 ) Fixes https://github.com/pytorch/pytorch/issues/141305. ```python class M(torch.nn.Module): def forward(self, x, y, z): a = y.shape[0] b = z.shape[0] def true_fn(x): return x + a def false_fn(x): return x + b * z # When exporting with non-strict: a and b are symints, # so torch.compile need to wrap and trace symint inputs. return torch.cond(x.shape[0] > 5, true_fn, false_fn, (x,)) ``` In non-strict export, when inputs are annotated with dynamic shape, the a, and b in above example are torch.SymInt type. true_fn and false_fn will have closure that're of torch.SymInt types. The error is triggered because we didn't handle SymInt inputs in dynamo and ends up using a UserDefinedObjectVariable for it, which doesn't have a proxy. We added support by following how we handle SymBool input previously. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141524 Approved by: https://github.com/zou3519 ghstack dependencies: #142185	2024-12-11 18:46:58 +00:00
Yidi Wu	a8fa98ccef	skip test dynamo for aot_dispatch tests on ci (#142185 ) A lot of tests in test_aotdispatch.py is not meaningful (from user's perspective) when we run with dynamo. So we skip them. Pull Request resolved: https://github.com/pytorch/pytorch/pull/142185 Approved by: https://github.com/zou3519	2024-12-11 18:46:58 +00:00
cyy	24a5a2ef25	[14/N] Fix extra warnings brought by clang-tidy-17 (#141644 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/141644 Approved by: https://github.com/ezyang	2024-12-11 18:40:42 +00:00
Jane Xu	be27dbf2b8	Enable CPP/CUDAExtension with py_limited_api for python agnosticism (#138088 ) Getting tested with ao, but now there is a real test i added. ## What does this PR do? We want to allow custom PyTorch extensions to be able to build one wheel for multiple Python versions, in other words, achieve python agnosticism. It turns out that there is such a way that setuptools/Python provides already! Namely, if the user promises to use only the Python limited API in their extension, they can pass in `py_limited_api` to their Extension class and to the bdist_wheel command (with a min python version) in order to build 1 wheel that will suffice across multiple Python versions. Sounds lovely! Why don't people do that already with PyTorch? Well 2 things. This workflow is hardly documented (even searching for python agnostic specifically does not reveal many answers) so I'd expect that people simply don't know about it. But even if they did, _PyTorch_ custom Extensions would still not work because we always link torch_python, which does not abide by py_limited_api rules. So this is where this PR comes in! We respect when the user specifies py_limited_api and skip linking torch_python under that condition, allowing users to enroll in the provided functionality I just described. ## How do I know this PR works? I manually tested my silly little ultra_norm locally (with `import python_agnostic`) and wrote a test case for the extension showing that - torch_python doesn't show up in the ldd tree - no Py- symbols show up It may be a little confusing that our test case is actually python-free (more clean than python-agnostic) but it is sufficient (and not necessary) towards showing that this change works. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138088 Approved by: https://github.com/ezyang, https://github.com/albanD	2024-12-11 18:22:55 +00:00
Xuehai Pan	fb02b40d27	[BE][accelerator] formalize API name `{current,set}_device_{idx => index}` (#140542 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/140542 Approved by: https://github.com/guangyey, https://github.com/albanD	2024-12-11 17:57:56 +00:00
cyy	82aaf64422	[3/N] Apply py39 ruff fixes (#142115 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/142115 Approved by: https://github.com/ezyang	2024-12-11 17:50:10 +00:00
Nichols A. Romero	f7e621c3ce	[ROCm] TunableOp do not log during exit (#142818 ) Depending on the order of static object destruction, the TunableOp logger is not available. Pull Request resolved: https://github.com/pytorch/pytorch/pull/142818 Approved by: https://github.com/jeffdaily	2024-12-11 17:44:29 +00:00
PyTorch MergeBot	233853a66f	Revert "Prologue Fusion (#134532 )" This reverts commit 59ab3825e77451b29c3a118fd24304afcbf52c09. Reverted https://github.com/pytorch/pytorch/pull/134532 on behalf of https://github.com/clee2000 due to A couple of PRs in this stack are breaking internally on different tests ([comment](https://github.com/pytorch/pytorch/pull/134532#issuecomment-2536643675))	2024-12-11 17:32:26 +00:00
PyTorch MergeBot	f0b80d014d	Revert "Update low prec codegen for div/mod (#142350 )" This reverts commit 1fb3d5a4e35d4ea5691287d4ce77da40578bda4a. Reverted https://github.com/pytorch/pytorch/pull/142350 on behalf of https://github.com/clee2000 due to A couple of PRs in this stack are breaking internally on different tests ([comment](https://github.com/pytorch/pytorch/pull/134532#issuecomment-2536643675))	2024-12-11 17:32:26 +00:00
PyTorch MergeBot	829a93562a	Revert "[Easy] factor out inductor ophandler decompositions (#142400 )" This reverts commit fa746e3eeb8e1cdcbe3f47ded9e3ca30efac383c. Reverted https://github.com/pytorch/pytorch/pull/142400 on behalf of https://github.com/clee2000 due to A couple of PRs in this stack are breaking internally on different tests ([comment](https://github.com/pytorch/pytorch/pull/134532#issuecomment-2536643675))	2024-12-11 17:32:26 +00:00
PyTorch MergeBot	9e88279737	Revert "Add a pass which analyzes whether a prologue preserves zero mask (#142401 )" This reverts commit 1a0bd402436af4c127817a31c76d7ae47d4668b2. Reverted https://github.com/pytorch/pytorch/pull/142401 on behalf of https://github.com/clee2000 due to A couple of PRs in this stack are breaking internally on different tests ([comment](https://github.com/pytorch/pytorch/pull/134532#issuecomment-2536643675))	2024-12-11 17:32:25 +00:00
PyTorch MergeBot	b118702a4e	Revert "Infer whether prologues can be computed without upcasting to fp32 without changing numerics (#142402 )" This reverts commit f2d8d7b7acf12f079cadc41b9fdd91cbae94daac. Reverted https://github.com/pytorch/pytorch/pull/142402 on behalf of https://github.com/clee2000 due to A couple of PRs in this stack are breaking internally on different tests ([comment](https://github.com/pytorch/pytorch/pull/134532#issuecomment-2536643675))	2024-12-11 17:32:25 +00:00
PyTorch MergeBot	2dcba6eac8	Revert "dont attempt to fuse in unaligned accesses to mm (#142435 )" This reverts commit 22683195964398b37ba0d539cb1bb55bff197db6. Reverted https://github.com/pytorch/pytorch/pull/142435 on behalf of https://github.com/clee2000 due to A couple of PRs in this stack are breaking internally on different tests ([comment](https://github.com/pytorch/pytorch/pull/134532#issuecomment-2536643675))	2024-12-11 17:32:25 +00:00
PyTorch MergeBot	5c97ac9721	Revert "Remove unused Python variables in torch/[_-a]* (#133492 )" This reverts commit fda975a7b3071a20dab8fc2c4e453479e1bb7cf2. Reverted https://github.com/pytorch/pytorch/pull/133492 on behalf of https://github.com/clee2000 due to Sorry, I need to revert this in order to revert something else. The only thing you need to do is rebase and remerge ([comment](https://github.com/pytorch/pytorch/pull/133492#issuecomment-2536635516))	2024-12-11 17:29:12 +00:00
Avik Chaudhuri	db51308d9c	fix output node name (#142506 ) Fixes #142227 Differential Revision: [D67043283](https://our.internmc.facebook.com/intern/diff/D67043283/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/142506 Approved by: https://github.com/ydwu4	2024-12-11 17:28:28 +00:00
PyTorch MergeBot	2374d460d0	Revert "filelock: Make waitcounter variant to use (#139816 )" This reverts commit 237c4b559c0f928dd89cf1e773458a1bdcea0b9d. Reverted https://github.com/pytorch/pytorch/pull/139816 on behalf of https://github.com/clee2000 due to Sorry, I need to revert this in order to revert something else. The only thing you need to do is rebase and remerge ([comment](https://github.com/pytorch/pytorch/pull/139816#issuecomment-2536616808))	2024-12-11 17:26:46 +00:00
Tom Ritchford	498a7808ff	Fix unused Python variables outside torch/ and test/ (#136359 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136359 Approved by: https://github.com/albanD	2024-12-11 17:10:23 +00:00
rzou	241bf047b3	[Dynamo] Skip some unresolvable tests (#142508 ) Fixes #127738 Fixes #127755 In the discussion in https://github.com/pytorch/pytorch/issues/127738 we determined that this is not fixable, so we're just going to skip the test. Test Plan: - tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/142508 Approved by: https://github.com/StrongerXi, https://github.com/yanboliang, https://github.com/mlazos ghstack dependencies: #142502, #142503	2024-12-11 17:00:23 +00:00
rzou	00ac4237b2	[Dynamo] stop import third-party astunparse (#142503 ) PyTorch's minimum version is 3.9, so we can now use ast.unparse. Test Plan: - wait for tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/142503 Approved by: https://github.com/StrongerXi, https://github.com/yanboliang, https://github.com/mlazos ghstack dependencies: #142502	2024-12-11 17:00:23 +00:00
rzou	0268abd627	[Dynamo] Stop importing transformers (#142502 ) This import was free because transformers should already have been imported by this time. Test Plan: - CI Pull Request resolved: https://github.com/pytorch/pytorch/pull/142502 Approved by: https://github.com/StrongerXi, https://github.com/yanboliang, https://github.com/mlazos	2024-12-11 17:00:22 +00:00
PyTorch MergeBot	8fd4b26504	Revert "[dynamo] Support multiple inheritance for custom dict construction (#142416 )" This reverts commit a45326b6497e47d01527e141cdd16d91fee94c18. Reverted https://github.com/pytorch/pytorch/pull/142416 on behalf of https://github.com/clee2000 due to The newly added test is faling internally D67056273 ([comment](https://github.com/pytorch/pytorch/pull/142416#issuecomment-2536537693))	2024-12-11 16:56:26 +00:00
Bradley Davis	c3b30c283f	add additional CK BMM instances (#142409 ) Summary: stacked changes to keep new codegen-ed instances below 2000 LOC Reviewed By: zjing14 Differential Revision: D66738746 Pull Request resolved: https://github.com/pytorch/pytorch/pull/142409 Approved by: https://github.com/mxz297, https://github.com/xw285cornell	2024-12-11 16:54:05 +00:00
iupaikov-amd	d622040ab1	[AOTI] Unskipped test_scaled_dot_product_efficient_attention for ROCm (#142138 ) The test should no longer fail for ROCm. Pull Request resolved: https://github.com/pytorch/pytorch/pull/142138 Approved by: https://github.com/janeyx99	2024-12-11 16:36:04 +00:00
nikitaved	d5e00412c7	sparse_broadcast_to: less memory footprint, fewer kernel launches (#142364 ) As per title. The following implementation removes the usage of `repeat_interleave, tile` and `full_coo_indices` and replaces them with broadcasting. That way we reduce memory traffic (and are likely to hit cache a lot) and the total number of launched kernels. Pull Request resolved: https://github.com/pytorch/pytorch/pull/142364 Approved by: https://github.com/amjames, https://github.com/cpuhrsch	2024-12-11 16:09:09 +00:00
Wouter Devriendt	eed9bb3a0e	allow -E to be in any spot in the compiler command (#142813 ) Follow up of TODO in https://github.com/pytorch/pytorch/pull/140614 It was found experimentally, that for one GPU architecture, `sccache` passes `-E` as 1st, 2nd or 3rd argument, but it's much better to do this if `-E` is passed as any argument No need to worry about exit or elif chains, as `exec` aborts script execution Pull Request resolved: https://github.com/pytorch/pytorch/pull/142813 Approved by: https://github.com/malfet Co-authored-by: Nikita Shulga <nshulga@meta.com>	2024-12-11 15:58:08 +00:00
Nichols A. Romero	ee817e8cf3	[ROCm] Second attempt to fix unit test: matmul_small_brute_force_tunableop (#142422 ) Fixes #141458 Fixes #141635 Fixes #141636 ~~Address OOM issue by clearing PyTorch's caching allocator.~~ Disabling this test on NVIDIA since it doesn't do much on NVIDIA hardware at the moment. Pull Request resolved: https://github.com/pytorch/pytorch/pull/142422 Approved by: https://github.com/jeffdaily, https://github.com/eqy	2024-12-11 15:36:37 +00:00
Ankita George	371bcc1e33	[checkpointing][oss] Throw an error when loading a different size than saved tensor (#141571 ) Summary: Fixing issue reported in https://github.com/pytorch/pytorch/issues/126604 Test Plan: buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/distributed/checkpoint:test_planner -- --exact 'caffe2/test/distributed/checkpoint:test_planner - test_planner.TestLoadPlanner: test_strict Differential Revision: D66389578 Pull Request resolved: https://github.com/pytorch/pytorch/pull/141571 Approved by: https://github.com/mhorowitz	2024-12-11 15:35:48 +00:00
Stephen Huan	bacd68107a	[inductor] Parenthesize expression in _helper_sqrt (#142352 ) Fixes https://github.com/pytorch/pytorch/issues/142328. The implied cast-then-sqrt order matches the behavior of the `halide` backend. `2cc01cc6d3/torch/_inductor/codegen/halide.py (L115-L116)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/142352 Approved by: https://github.com/ezyang	2024-12-11 15:30:52 +00:00
Edward Z. Yang	86300965b6	Add automatic_dynamic_shapes_mark_as == "oblivious" (#141444 ) Fixes https://github.com/pytorch/pytorch/issues/137100 Should also add a mark_oblivious API for manual control. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/141444 Approved by: https://github.com/bobrenjc93 ghstack dependencies: #141415	2024-12-11 14:39:13 +00:00
Edward Z. Yang	e53696bfdb	automatic_dynamic_shapes_mark_as (#141415 ) This adds an option to cause automatic dynamic shapes to trigger unbacked SymInts rather than backed SymInts. This can potentially help if you are still seeing recompilations from 0/1 specialization but it also might just cause your program to fail with GuardOnDataDependent errors. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/141415 Approved by: https://github.com/bobrenjc93	2024-12-11 14:39:13 +00:00
RAHUL SINGH	b576a8c318	Tests Generelization for multiple accelerator devices (#139184 ) Motivation: Generalize unit tests so that can be executed for cuda and non cuda devices. Depedency : #133209 Merged now. There was a #135242 for these changes and closed due to in correct commits. I have incoroprated the changes as suggested in comments. @kwen2501 @zeshengzong Please review the changes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139184 Approved by: https://github.com/kwen2501 Co-authored-by: Yu, Guangye <guangye.yu@intel.com>	2024-12-11 13:31:20 +00:00
Michael Lazos	539c46b6e8	[Dynamo] Add register_hook as in-graph tensor method (#142820 ) Fixes https://github.com/pytorch/pytorch/issues/141046 Pull Request resolved: https://github.com/pytorch/pytorch/pull/142820 Approved by: https://github.com/StrongerXi, https://github.com/yanboliang	2024-12-11 12:02:03 +00:00
Edward Z. Yang	c29b4edbb9	Remove no-op aot_compilation_time (#142490 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/142490 Approved by: https://github.com/xuzhao9	2024-12-11 10:37:25 +00:00
Bob Ren	30d8b30db7	refactor tensorify restart logic to use sources (#141517 ) Differential Revision: [D67066706](https://our.internmc.facebook.com/intern/diff/D67066706) Pull Request resolved: https://github.com/pytorch/pytorch/pull/141517 Approved by: https://github.com/ezyang	2024-12-11 07:15:39 +00:00
Natalia Gimelshein	bdbdbeeb3d	Implements nonzero_static on cuda (#141838 ) using blockwide cub primitives. This adds CUDA functionality for nonzero_static, which was missing in https://github.com/pytorch/pytorch/pull/97417. For `size` approx equal to number of nonzeros, the perf is very close to the regular version, for larger sizes filling in padding indices takes additional time. Disabled for cuda <=11.4 Pull Request resolved: https://github.com/pytorch/pytorch/pull/141838 Approved by: https://github.com/ezyang, https://github.com/malfet	2024-12-11 06:44:48 +00:00
IvanKobzarev	1d3b0108a6	[subclass] Fix unwrap subclass parametrization for Nested subclasses (#142481 ) @tugsbayasgalan found a bug for nested subclasses: E.g. we have TwoTensor(TwoTensor(t1, t2), t0). After right_inverse we have: rebuilt_stack == [(TwoTensor, meta, ["a", "b"]), (TwoTensor, meta, ["a", "b"])] plain_tensors == [t0, t1, t2] We will first put plain tensors, and only then the nested TwoTensor. But when we unflatten: todo = [t0, t1, t2] we first create TwoTensor[t1, t2] put it to todo [t0, TwoTensor[t1, t2]] And as a result get TwoTensor(t0, TwoTensor(t1, t2)) which is swapping original a and b :) So the fix should be different, we need to preserve the order of elements in the stack for plain/subclasses. I will think about the fix. Fix: Keep order of inner_tensor_attr_names according them added to the stack. (first - plain tensor attributes, then subclass attributes) Test: ``` python test/functorch/test_aotdispatch.py -k test_subclass_parameters ``` Differential Revision: [D67032477](https://our.internmc.facebook.com/intern/diff/D67032477) Pull Request resolved: https://github.com/pytorch/pytorch/pull/142481 Approved by: https://github.com/tugsbayasgalan, https://github.com/bdhirsh	2024-12-11 06:05:48 +00:00
Avik Chaudhuri	7e92b02e09	add test for module list slice (#142520 ) Nothing to fix for #142439 Differential Revision: [D67049962](https://our.internmc.facebook.com/intern/diff/D67049962/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/142520 Approved by: https://github.com/ydwu4	2024-12-11 05:11:00 +00:00
Edward Z. Yang	256bfd1096	Rename 'cache limit' to 'recompile limit' (#141542 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/141542 Approved by: https://github.com/oulgen, https://github.com/jansel	2024-12-11 05:05:11 +00:00
Edward Z. Yang	84cf94ee0b	Make more of the reshape_symint stride calculation oblivious (#142488 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/142488 Approved by: https://github.com/albanD	2024-12-11 05:04:42 +00:00
Edward Z. Yang	921ba0a75e	Mark torch._library.custom_ops / torch._dynamo.eval_frame as uninteresting (#142492 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/142492 Approved by: https://github.com/bobrenjc93	2024-12-11 05:03:30 +00:00
Jane Xu	47a571e166	Document that load_inline requires having a compiler installed (#137521 ) Prompted by this forum q: https://discuss.pytorch.org/t/are-the-requirements-for-using-torch-utils-cpp-extension-with-cuda-documented-anywhere/211222 Would be curious to know if we could get more precise. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137521 Approved by: https://github.com/zou3519	2024-12-11 03:47:54 +00:00
emmettbicker	21833c9642	Added Diffentiable per_sample_weights Check to EmbeddingBag.cpp (#142338 ) Added a check in aten/src/ATen/native/EmbeddingBag.cpp that checks if per_sample_weights needs a gradient in order to determine if at::_embedding_bag_forward_only or at::_embedding_bag should run. Also, added two tests in test_embedding.py that check if the command now works. Fixes #136457 Pull Request resolved: https://github.com/pytorch/pytorch/pull/142338 Approved by: https://github.com/soulitzer	2024-12-11 03:42:17 +00:00
Hyunho Yeo	92cc345683	Implement "torch.mtia.max_memory_allocated" API (#142406 ) Summary: This diff implements the inferface of "torch.mtia.max_memory_allocated" API. The internal implementation will be addressed in a separate diff. Test Plan: Passed a local unit test: `buck run //mtia/host_runtime/torch_mtia/tests:test_torch_mtia_api` ``` ---------------------------------------------------------------------- Ran 15 tests in 16.862s OK I1127 11:31:14.613909 2272144 afg_bindings.cpp:943] afg-aten::mul.out-dtype_Float-uqJKuNc0 executable has been unloaded I1127 11:31:14.615438 2272144 afg_bindings.cpp:943] afg-add-dtype_Float-fa37JncC executable has been unloaded ``` Reviewed By: ttrung149, nautsimon Differential Revision: D66553954 Pull Request resolved: https://github.com/pytorch/pytorch/pull/142406 Approved by: https://github.com/nautsimon	2024-12-11 03:06:18 +00:00
Xuan Zhang	ed388394d1	add torchrec collectives to enforce global ordering (#141970 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/141970 Approved by: https://github.com/yf225	2024-12-11 02:45:24 +00:00
Michael Lazos	082124a322	[Dynamo] Refactor to use install subgraph method in higher order ops (#141384 ) Replaced the function in HOP infra with a method on output graph to make it more general and accessible. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141384 Approved by: https://github.com/zou3519 ghstack dependencies: #141381, #141382, #141383	2024-12-11 02:22:21 +00:00
Michael Lazos	c31543c7ae	[Dynamo] Initial deduplication pass impl (#141383 ) This PR implements the deduplication pass (blocked by config currently) for dynamo where identical regions from https://github.com/pytorch/pytorch/pull/141381 are replaced with a common subgraph. The two phases of deduplication are explained below. Subgraph creation: Subgraph creation works by taking one representative region from each region group and creating a subgraph from it, which will then be used to replace all regions in the group. This is implemented by first copying all nodes of the region to the new subgraph and then finding all inputs which are not within the region and creating placeholders for them. For the outputs, all regions in a region group need to be scanned to ensure the largest set of outputs is found, and then an output node is created which returns a tuple of all outputs. Graph replacement: To replace each region with the extracted subgraph, the node index in the region and argument index within the node's flattened args and kwargs are recorded once during subgraph creation. This allows us to determine which (external to the region) nodes and in which order these nodes are passed as inputs. For the outputs, getitem nodes are created for each output, and all nodes in the region with external outputs are replaced by the proper getitem node. Finally, all original nodes are erased (there should be no uses of these left in the graph). Pull Request resolved: https://github.com/pytorch/pytorch/pull/141383 Approved by: https://github.com/zou3519 ghstack dependencies: #141381, #141382	2024-12-11 02:22:21 +00:00
Michael Lazos	49e4307686	[Dynamo] add debug logging for graph region expansion (#141382 ) This PR adds debug logging for the region expansion algorithm. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141382 Approved by: https://github.com/williamwen42 ghstack dependencies: #141381	2024-12-11 02:22:21 +00:00
Michael Lazos	96c36a6947	[Dynamo] Implement graph region tracking for deduplication (#141381 ) This PR implements graph region tracking for later extraction into common subgraphs. The algorithm is as follows: `GraphRegionTracker` tracks each node added to the output graph and generates a key based on the source location, instruction pointer, input shapes, and global state at the time the node is inserted into the graph. Nodes with the same key are grouped together in a list of identical nodes. Once graph capture is complete, these nodes are organized into region groups. A region group looks like this: [[IdenticalNode1], [IdenticalNode2], [IdenticalNode3]] and each sublist is called a region. For each region group (starting at the topologically latest region group), the inner regions are gradually expanded one node at time from args and kwargs of the node in each region provided that for all regions in the group, the nodes being added are also identical (ie have the same key computed above). The `get_identical_regions` function is the main entry point which will be used by the graph replacement algorithm in #141383 Edge cases to add more testing for in future PRs (in progress): * ~~multiple nodes on the same line~~ (implemented) * ~~dynamic shapes checking (need to verify symbolic inputs are the same across subgraphs)~~ (implemented) * ensure we don't expand regions where it will create a cycle during subgraph replacement * ensure outputs are always tensors (or tuples of tensors iirc) * ~~out of order kwargs, unevenly nested kwargs~~ (implemented) * input aliasing - TBD, we may add support for this in `invoke_subgraph` or reuse the aliasing analysis here to not form regions with these properties * ~~all global state~~ (implemented) Other followups: * consolidate global state checking across all caching infra Pull Request resolved: https://github.com/pytorch/pytorch/pull/141381 Approved by: https://github.com/zou3519	2024-12-11 02:22:21 +00:00
Yu, Guangye	734bb01460	[RELAND] Add device-agnostic runtime Device/Stream C++ API (#138677 ) # Motivation This PR intends to add C++ accelerator device-agnostic APIs. # Additional Context This PR is relanded. It is reverted because `torch.Event` doesn't support mps backend. We have fixed it in https://github.com/pytorch/pytorch/pull/142468. The previous commit is `f84e533a2c` Pull Request resolved: https://github.com/pytorch/pytorch/pull/138677 Approved by: https://github.com/albanD, https://github.com/EikanWang ghstack dependencies: #142468, #133572	2024-12-11 02:04:52 +00:00
Yu, Guangye	2091194249	[RELAND] Add UTs for accelerator device-agnostic runtime APIs (#133572 ) # Motivation This PR intends to add UTs for accelerator device-agnostic APIs. # Additional Context This PR is relanded. It is reverted because `torch.Event` doesn't support mps backend. We have fixed it in https://github.com/pytorch/pytorch/pull/142468. The previous commit is `952514f0c8` Pull Request resolved: https://github.com/pytorch/pytorch/pull/133572 Approved by: https://github.com/EikanWang, https://github.com/albanD ghstack dependencies: #142468	2024-12-11 02:04:52 +00:00
Howard Huang	88154024b3	[pipelining] Add ZBV schedule (#142084 ) Adds ZBV schedule which is explained in https://arxiv.org/pdf/2401.10241, Section 6. Tested it works under the new PipelineScheduleRuntime by fixing a small bug in handling V-shaped schedules. This PR is a replacement for https://github.com/pytorch/pytorch/pull/138444 cc the original authors: @QPHutu @ufotalent https://github.com/pytorch/pytorch/pull/138444#issuecomment-2472684977 Pull Request resolved: https://github.com/pytorch/pytorch/pull/142084 Approved by: https://github.com/kwen2501	2024-12-11 02:00:57 +00:00
Nikita Shulga	95b17f6346	[MPS] Add CompileShader method (#141478 ) This allows one to do something like that ```python import torch x = torch.ones(10, device="mps") m = torch.mps._compile_shader(""" kernel void foo(device float* x, uint idx [[thread_position_in_grid]]) { x[idx] += idx; } ") m.foo(x) ``` And in general enables writing custom operators using Metal shaders purely in Python Pull Request resolved: https://github.com/pytorch/pytorch/pull/141478 Approved by: https://github.com/manuelcandales	2024-12-11 02:00:51 +00:00
Bin Bao	d2e5e5b1a5	[AOTI] Remove redudant AOTI_TORCH_EXPORT (#142500 ) Summary: Remove redundant AOTI_TORCH_EXPORT from shim_common.cpp since these functions are already declared with AOTI_TORCH_EXPORT in the corresponding header file. This is to solve the issue in https://github.com/pytorch/pytorch/pull/140030#issuecomment-2528760716 Differential Revision: D67031626 Pull Request resolved: https://github.com/pytorch/pytorch/pull/142500 Approved by: https://github.com/frank-wei	2024-12-11 01:59:37 +00:00
cyy	7d98b3dcee	[3/N] Apply bugprone-unchecked-optional-access (#142442 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/142442 Approved by: https://github.com/albanD	2024-12-11 01:39:10 +00:00
Yang Wang	2b105de2c1	[Monitor] Enable non-perf linux test monitor (#142168 ) # Overview Enable monitorings for non-perf linux tests # Other - move monitoring step right before build artifact for mac_test.yml, notice this test is not enable monitoring now Pull Request resolved: https://github.com/pytorch/pytorch/pull/142168 Approved by: https://github.com/huydhn, https://github.com/ZainRizvi	2024-12-11 01:10:43 +00:00
PyTorch MergeBot	393cf46f42	Revert "[MPS] Add CompileShader method (#141478 )" This reverts commit 0478fee42db16a0477add1d0a644ce713f31a875. Reverted https://github.com/pytorch/pytorch/pull/141478 on behalf of https://github.com/malfet due to Broke doctests, by trying to run MPS example on Linux ([comment](https://github.com/pytorch/pytorch/pull/141478#issuecomment-2533351909))	2024-12-11 00:37:10 +00:00
Nikita Shulga	b94a206414	[CI] Use sccache-0.8.2 for CUDA builds (#140614 ) Instead of an ancient prebuilt binary This is a followup from https://github.com/pytorch/pytorch/pull/121323 For some reason, newer `sccache` does not work when `gcc` is invoked with `-E` option, so one have to special-case `-E` case in `/opt/ccache/bin/gcc` wrapper, which had to be special cased to work with `nvcc` by checking whether `-E` is passed not only as first or second, but as 3rd argument as well(to be followed up by a generic https://github.com/pytorch/pytorch/pull/142813 ), i.e. to generate following wrapper: ```shell #!/bin/sh if [ "$1" = "-E" ] \|\| [ "$2" = "-E" ] \|\| [ "$3" = "-E" ]; then exec /usr/bin/gcc "$@" elif [ $(env -u LD_PRELOAD ps -p $PPID -o comm=) != sccache ]; then exec sccache /usr/bin/gcc "$@" else exec /usr/bin/gcc "$@" fi ``` Without it `sccache nvcc hello.cu` failed with no-descriptive ``` sccache: error: failed to execute compile sccache: caused by: Compiler not supported: "" ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/140614 Approved by: https://github.com/wdvr Co-authored-by: Wouter Devriendt <wouterdevriendt@meta.com>	2024-12-11 00:34:38 +00:00
Chirag Pandya	ea152d2472	[be] better error message for flight recorder status (#142505 ) Summary: Change back log to VLOG(2) in waitForFutureOrTimeout. Instead, print a more user friendly message - if FR completes successfully. This message is meant for developers only - so don't default to `INFO` in this function. Also, change one more message from LOG(ERROR) to LOG(INFO). Tested locally. Pull Request resolved: https://github.com/pytorch/pytorch/pull/142505 Approved by: https://github.com/kwen2501	2024-12-11 00:10:59 +00:00
eqy	bd4071f0b0	[Matmul][CUDA][FP8] Skip rowwise scaling tests on non-`sm90` (#141596 ) Since the current kernel is using sm90-specific features, just pre-emptively skip the test for any non-sm90 compute capabilities Pull Request resolved: https://github.com/pytorch/pytorch/pull/141596 Approved by: https://github.com/drisspg	2024-12-10 23:16:19 +00:00
mori360	4a16a60052	[C10D] Add better profiling title for NCCL barrier, nccl:all_reduce to nccl:all_reduce_barrier (#140785 ) Fixes [issue](https://github.com/pytorch/pytorch/issues/140257) Pull Request resolved: https://github.com/pytorch/pytorch/pull/140785 Approved by: https://github.com/wconstab	2024-12-10 23:08:15 +00:00
Colin L. Rice	237c4b559c	filelock: Make waitcounter variant to use (#139816 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139816 Approved by: https://github.com/ezyang	2024-12-10 23:02:59 +00:00
Scott Wolchok	e36fbbf826	Fix ARM bfloat16 fmsub & improve vec_test_all_types coverage (#142499 ) This function was very broken and untested. Now it is tested, and vec_test_all_types is passing internally as well. Differential Revision: [D67036894](https://our.internmc.facebook.com/intern/diff/D67036894/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/142499 Approved by: https://github.com/malfet	2024-12-10 22:51:41 +00:00
Nikita Shulga	0478fee42d	[MPS] Add CompileShader method (#141478 ) This allows one to do something like that ```python import torch x = torch.ones(10, device="mps") m = torch.mps._compile_shader(""" kernel void foo(device float* x, uint idx [[thread_position_in_grid]]) { x[idx] += idx; } ") m.foo(x) ``` And in general enables writing custom operators using Metal shaders purely in Python Pull Request resolved: https://github.com/pytorch/pytorch/pull/141478 Approved by: https://github.com/manuelcandales	2024-12-10 22:43:17 +00:00
Ruben Rodriguez Buchillon	95e7fcf82e	inductor: remove duplicate triton configs for autotuning (#142254 ) Summary: # Why - sampling the same config multiple times is wasteful, especially on exhaustive - for AMD we rewrite the configs to have a specific number of stages, which might lead to some configs appearing multiple times # What cast the configs, already defined as a tuple, through a set to remove duplicates Test Plan: taken from the `mm_kernel_configs` logic in the same file ``` >>> mm_kernel_configs = [ {"config": (BLOCK_M, BLOCK_N, BLOCK_K, num_stages, num_warps), "cond": True} for BLOCK_M, BLOCK_N, BLOCK_K in itertools.product( [16, 32, 64, 128, 256], repeat=3 ) for num_stages in [1, 2, 3, 4, 5] for num_warps in [2, 4, 8] ] >>> configs = [c['config'] for c in mm_kernel_configs] >>> a = tuple((c[0], c[1], c[2], 0, c[4]) for c in configs) >>> len(set(a)) 375 >>> len(a) 1875 >>> ``` Differential Revision: D66893774 Pull Request resolved: https://github.com/pytorch/pytorch/pull/142254 Approved by: https://github.com/henrylhtsang	2024-12-10 22:19:06 +00:00
Yidi Wu	da29c13693	[while_loop] data-dependent op in body_fn (#142031 ) The idea is the parent hop's fake tensor mode should ignore the newly allocated unbacked symints in subgraph because the bindings of unbacked symbols in the subgraph should already be done when we trace the subgraph. E.g. if there's an operator in subgraph that produces unbacked symints, the track_tensor_tree logic for that operator will take care of it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/142031 Approved by: https://github.com/zou3519 ghstack dependencies: #142162	2024-12-10 21:54:28 +00:00
Yidi Wu	7111cd6ee0	[hop][BE] add util diff_meta with prettier error message. (#142162 ) The error message changes from: ```python -torch._dynamo.exc.Unsupported: Expected branches to return tensors with same metadata. [(tensor_pair, difference)...]:[('pair0:', TensorMetadata(shape=torch.Size([4, 3]), dtype=torch.float32, requires_grad=False, stride=(3, 1), memory_format=None, is_quantized=False, qparams={}), TensorMetadata(shape=torch.Size([2, 3]), dtype=torch.float32, requires_grad=False, stride=(3, 1), memory_format=None, is_quantized=False, qparams={}))] ``` to ```python +torch._dynamo.exc.Unsupported: Expect branches to return tensors with same metadata but find pair[0] differ in 'shape', where lhs is TensorMetadata(shape=torch.Size([4, 3]), dtype=torch.float32, requires_grad=False, stride=(3, 1), memory_format=None, is_quantized=False, qparams={}) and rhs is TensorMetadata(shape=torch.Size([2, 3]), dtype=torch.float32, requires_grad=False, stride=(3, 1), memory_format=None, is_quantized=False, qparams={}) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/142162 Approved by: https://github.com/zou3519	2024-12-10 21:54:28 +00:00
Yidi Wu	9ced54a51a	[hop] lift free symbols in slice (#142385 ) Before the change, we get an unfound proxy error when linting the subgraph. After the change, we have the following dynamo graph for dynamic_shape test. ```python V1209 11:11:06.187000 4091124 torch/_dynamo/output_graph.py:1346] [0/2] [__graph_code] /data/users/yidi/pytorch/torch/fx/_lazy_graph_module.py class GraphModule(torch.nn.Module): V1209 11:11:06.187000 4091124 torch/_dynamo/output_graph.py:1346] [0/2] [__graph_code] def forward(self, s0: "Sym(s0)", s1: "Sym(s1)", s2: "Sym(s2)", L_x_: "f32[s0, s1, s2][s1s2, s2, 1]cpu"): V1209 11:11:06.187000 4091124 torch/_dynamo/output_graph.py:1346] [0/2] [__graph_code] l_x_ = L_x_ V1209 11:11:06.187000 4091124 torch/_dynamo/output_graph.py:1346] [0/2] [__graph_code] V1209 11:11:06.187000 4091124 torch/_dynamo/output_graph.py:1346] [0/2] [__graph_code] # File: /data/users/yidi/pytorch/test/dynamo/test_higher_order_ops.py:307 in f, code: i = x.size(0) - 2 V1209 11:11:06.187000 4091124 torch/_dynamo/output_graph.py:1346] [0/2] [__graph_code] sub: "Sym(s0 - 2)" = s0 - 2 V1209 11:11:06.187000 4091124 torch/_dynamo/output_graph.py:1346] [0/2] [__graph_code] V1209 11:11:06.187000 4091124 torch/_dynamo/output_graph.py:1346] [0/2] [__graph_code] # File: /data/users/yidi/pytorch/test/dynamo/test_higher_order_ops.py:308 in f, code: j = x.size(1) - 3 V1209 11:11:06.187000 4091124 torch/_dynamo/output_graph.py:1346] [0/2] [__graph_code] sub_1: "Sym(s1 - 3)" = s1 - 3 V1209 11:11:06.187000 4091124 torch/_dynamo/output_graph.py:1346] [0/2] [__graph_code] V1209 11:11:06.187000 4091124 torch/_dynamo/output_graph.py:1346] [0/2] [__graph_code] # File: /data/users/yidi/pytorch/test/dynamo/test_higher_order_ops.py:310 in f, code: return wrap(lambda x: x[:i, :j, k:], x) V1209 11:11:06.187000 4091124 torch/_dynamo/output_graph.py:1346] [0/2] [__graph_code] wrap_body_0 = self.wrap_body_0 V1209 11:11:06.187000 4091124 torch/_dynamo/output_graph.py:1346] [0/2] [__graph_code] wrap = torch.ops.higher_order.wrap(wrap_body_0, s0, s1, s2, l_x_, sub, sub_1); wrap_body_0 = s0 = s1 = s2 = l_x_ = sub = sub_1 = None V1209 11:11:06.187000 4091124 torch/_dynamo/output_graph.py:1346] [0/2] [__graph_code] getitem: "f32[s0 - 2, s1 - 3, 0][s1s2, s2, 1]cpu" = wrap[0]; wrap = None V1209 11:11:06.187000 4091124 torch/_dynamo/output_graph.py:1346] [0/2] [__graph_code] return (getitem,) V1209 11:11:06.187000 4091124 torch/_dynamo/output_graph.py:1346] [0/2] [__graph_code] V1209 11:11:06.187000 4091124 torch/_dynamo/output_graph.py:1346] [0/2] [__graph_code] class wrap_body_0(torch.nn.Module): V1209 11:11:06.187000 4091124 torch/_dynamo/output_graph.py:1346] [0/2] [__graph_code] def forward(self, s0: "Sym(s0)", s1: "Sym(s1)", s2: "Sym(s2)", l_x_: "f32[s0, s1, s2][s1s2, s2, 1]cpu", sub: "Sym(s0 - 2)", sub_1: "Sym(s1 - 3)"): V1209 11:11:06.187000 4091124 torch/_dynamo/output_graph.py:1346] [0/2] [__graph_code] # File: /data/users/yidi/pytorch/test/dynamo/test_higher_order_ops.py:310 in <lambda>, code: return wrap(lambda x: x[:i, :j, k:], x) V1209 11:11:06.187000 4091124 torch/_dynamo/output_graph.py:1346] [0/2] [__graph_code] getitem: "f32[s0 - 2, s1 - 3, 0][s1s2, s2, 1]cpu" = l_x_[(slice(None, sub, None), slice(None, sub_1, None), slice(s2, None, None))]; l_x_ = sub = sub_1 = s2 = None V1209 11:11:06.187000 4091124 torch/_dynamo/output_graph.py:1346] [0/2] [__graph_code] return (getitem,) ``` We lift sub, sub_1 because they're compound expressions and are directly used in argument of the getitem node. We lift s0, s1 and s2 because they're basic symbols in the tensor input. Pull Request resolved: https://github.com/pytorch/pytorch/pull/142385 Approved by: https://github.com/zou3519	2024-12-10 21:52:30 +00:00
Doru Bercea	795ff0e9f7	[ROCm] Improve reduce sum calculation for low CU count (#141378 ) Improve reduce sum calculation for low CU count by enabling splitting the rows across warps for some 2D tensor shapes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141378 Approved by: https://github.com/jeffdaily	2024-12-10 21:48:56 +00:00
Tom Ritchford	fda975a7b3	Remove unused Python variables in torch/[_-a]* (#133492 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133492 Approved by: https://github.com/albanD	2024-12-10 21:48:44 +00:00
eellison	2268319596	dont attempt to fuse in unaligned accesses to mm (#142435 ) This isn't profitable - we were trying to fuse in a padding of unaligned mm, which defeats padding's purpose. Pull Request resolved: https://github.com/pytorch/pytorch/pull/142435 Approved by: https://github.com/jansel ghstack dependencies: #134532, #142350, #142400, #142401, #142402	2024-12-10 21:35:26 +00:00
eellison	f2d8d7b7ac	Infer whether prologues can be computed without upcasting to fp32 without changing numerics (#142402 ) For prologues which only do either loads like gathers or dtype conversions, and no actual arithmetic on lower-precision types, we can codegen them without upcasting to fp32 without changing numerics. Prologues that actually do arithmetic will need to use invoke quant. But I would like to to support upcasts/gathers out of the box. We could potentially extend this in the future to avoid upcasting max pooling operations as well, if there were perf benefits to be had (less likely). Pull Request resolved: https://github.com/pytorch/pytorch/pull/142402 Approved by: https://github.com/jansel ghstack dependencies: #134532, #142350, #142400, #142401	2024-12-10 21:26:03 +00:00
Yu, Guangye	bee445c3a3	[MPS] Support torch.Event for MPS (#142468 ) # Motivation Support `torch.Event` on mps backend. Pull Request resolved: https://github.com/pytorch/pytorch/pull/142468 Approved by: https://github.com/malfet	2024-12-10 21:17:25 +00:00
eellison	1a0bd40243	Add a pass which analyzes whether a prologue preserves zero mask (#142401 ) We load inputs to prologue fusion with a mask. That mask must still be zero before we run `tl.dot`. Previously, we would always apply the mask: ``` tmp0 = tl.load(in_ptr1 + (tl.broadcast_to(xindex, xindex.shape)), a_mask, eviction_policy='evict_last') tmp1 = tmp0.to(tl.float32) a = tl.where(a_mask, tmp1, 0.0) ``` now we do not need to -> ``` tmp0 = tl.load(in_ptr1 + (tl.broadcast_to(xindex, xindex.shape)), a_mask, eviction_policy='evict_last') tmp1 = tmp0.to(tl.float32) a = tmp1 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/142401 Approved by: https://github.com/jansel ghstack dependencies: #134532, #142350, #142400	2024-12-10 21:16:13 +00:00
Nikita Shulga	c30dd35877	[Device] Add "mps" to `torch._utils._get_device_attr` (#142447 ) Follow up after https://github.com/pytorch/pytorch/pull/141098 Pull Request resolved: https://github.com/pytorch/pytorch/pull/142447 Approved by: https://github.com/kit1980	2024-12-10 20:57:17 +00:00
Huy Do	6f8751dcc9	Fix timeout check workflow lint job (#142476 ) Fixes https://github.com/pytorch/pytorch/issues/142485 The workflow check lint job timed out in trunk, i.e. https://github.com/pytorch/pytorch/actions/runs/12261226178/job/34207762939, and here was what happened: 1. https://github.com/pytorch/pytorch/pull/142294 landed yesterday to build ROCm on 3.13, but the PR had a landrace with https://github.com/pytorch/pytorch/pull/142282 in the generated workflow file 2. The trunk lint check caught that in https://github.com/pytorch/pytorch/blob/main/.github/scripts/report_git_status.sh#L2 3. However, the script also attempted to print the difference with `git diff .github/workflows`. This command was the one that stuck because `git diff` uses page by default and requires a prompt to display the next page ¯\_(ツ)_/¯ It took so long to debug this because a timeout Nova GHA doesn't print any progress. I'll create an issue for this. Bonus: I also fix the broken print from test tool lint job that confuses GitHub https://github.com/pytorch/pytorch/actions/runs/12261226178 with an annotation failure `Credentials could not be loaded, please check your action inputs` Pull Request resolved: https://github.com/pytorch/pytorch/pull/142476 Approved by: https://github.com/wdvr	2024-12-10 20:47:22 +00:00
atalman	f57606ab85	Migrate smoke tests to pytorch/pytorch (#142482 ) Related to https://github.com/pytorch/builder/issues/2054 This should fix nightly xpu failure: https://github.com/pytorch/pytorch/actions/runs/12251477588/job/34180135207 and rocm failure: https://github.com/pytorch/pytorch/actions/runs/12251477588/job/34182185374 due to missing : `` /builder/check_binary.sh`` Builder Scripts revision: `3468139e81` Pull Request resolved: https://github.com/pytorch/pytorch/pull/142482 Approved by: https://github.com/chuanqi129, https://github.com/kit1980, https://github.com/malfet, https://github.com/jeffdaily, https://github.com/huydhn	2024-12-10 20:43:36 +00:00
Yuanjing Shi	117b6c3e2c	[Easy][Dynamo][TVM] remove unnecessary prints (#142445 ) This PR intends to remove the unnecessary prints in the auto-scheduler of dynamo's TVM backend. Pull Request resolved: https://github.com/pytorch/pytorch/pull/142445 Approved by: https://github.com/jansel	2024-12-10 19:52:02 +00:00
Catherine Lee	e95bd337e1	Some workflows to use oidc instead of AWS keys (#142264 ) Roles defined in https://github.com/pytorch-labs/pytorch-gha-infra/pull/563 With this, I think we can get rid of the AWS credentials in the upload-stats environment Untestable because I can't add branches to the upload-stats environment Pull Request resolved: https://github.com/pytorch/pytorch/pull/142264 Approved by: https://github.com/huydhn	2024-12-10 19:40:23 +00:00
Ryan Guo	3c03bc2431	[dynamo] Expand support of enum attribute access (#142268 ) This patch changes `EnumVariable` to support access to all types of attributes, not just non-callable literals. Fixes #142050. Pull Request resolved: https://github.com/pytorch/pytorch/pull/142268 Approved by: https://github.com/jansel ghstack dependencies: #142267	2024-12-10 19:32:40 +00:00
Ryan Guo	b117945918	[dynamo] Remove dead code in `ConstantVariable.const_getattr` (#142267 ) This path is no longer reachable after #113390, which also updated `test_access_class_method_from_user_class` to reflect that. Pull Request resolved: https://github.com/pytorch/pytorch/pull/142267 Approved by: https://github.com/jansel	2024-12-10 19:32:40 +00:00
Ryan Guo	f74ba5d30d	[dynamo] Remove special graph break for self-referential list (#142438 ) We introduced a special graph break to avoid max-recursion-depth error in #100296. After #111415, the original `test_list_self_reference` no longer triggers the special graph break because we started modeling root frame free variables with `LazyVariableTracker`. After #117426, we no longer build the list items eagerly, and they'll hit `variable_tracker_cache` when they get lazily constructed later. As a result, this patch updates the `test_list_self_reference` test and removes the special graph break. Pull Request resolved: https://github.com/pytorch/pytorch/pull/142438 Approved by: https://github.com/jansel ghstack dependencies: #142437	2024-12-10 19:23:48 +00:00
Ryan Guo	4f75f1e80d	[dynamo] Use proper item source for `NamedTupleVariable` (#142437 ) Dynamo was generating `GetItemSource(tuple_source, index)` for items of `NamedTupleVariable`, but that stops working when a user supplied named tuple has a custom `__getitem__` function with different semantics. This patch - fixes the aforementioned issue by using `AttrSource` instead. - handles named tuple outside `wrap_listlike`, by removing the special case of named tuple in `BaseListVariable.cls_for_instance`, since the semantics of named tuple is different enough. - makes user all constructions of `NamedTupleVariable` has items with proper sources. Fixes #142399. Pull Request resolved: https://github.com/pytorch/pytorch/pull/142437 Approved by: https://github.com/jansel	2024-12-10 19:23:48 +00:00
Ryan Guo	a45326b649	[dynamo] Support multiple inheritance for custom dict construction (#142416 ) This patch applies a local and practical workaround for custom dict construction when multiple inheritance is involved. Handling multiple inheritance in general could be a lot more involved, so I created #142414 to track that. Fixes #141118. Pull Request resolved: https://github.com/pytorch/pytorch/pull/142416 Approved by: https://github.com/jansel	2024-12-10 19:22:15 +00:00
Alexander Grund	67cf126cf8	Disable PIP version check in collect_env (#142308 ) Disables version check which might require users to reach out to PyPI, reference: https://pip.pypa.io/en/latest/cli/pip/#cmdoption-disable-pip-version-check Switches pip to be used directly as a python module (`python3 -mpip`) instead of relying on `pip3` or `pip` Pull Request resolved: https://github.com/pytorch/pytorch/pull/142308 Approved by: https://github.com/seemethere	2024-12-10 19:16:36 +00:00
PyTorch MergeBot	3e28da1e06	Revert "skip test dynamo for aot_dispatch tests on ci (#142185 )" This reverts commit 7eda06b36674afa117b28ad807c3421c94e775c1. Reverted https://github.com/pytorch/pytorch/pull/142185 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but I think it has a landrace in trunk ([comment](https://github.com/pytorch/pytorch/pull/142185#issuecomment-2532605728))	2024-12-10 18:50:17 +00:00
PyTorch MergeBot	9aefc59649	Revert "[hop][dynamo] support torch.SymInt inputs (#141524 )" This reverts commit 6713b457aee3e36ab2499fb31b733ecd7104c764. Reverted https://github.com/pytorch/pytorch/pull/141524 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but I think it has a landrace in trunk ([comment](https://github.com/pytorch/pytorch/pull/142185#issuecomment-2532605728))	2024-12-10 18:50:17 +00:00
Shivam Raikundalia	d102cfa2cb	[Profiler] Add CUDA Overhead to Auto-trace (#142271 ) Summary: We already have CUDA OVERHEAD events enabled in on-demand so we should also add them to auto-trace Test Plan: Tested using internal performance suites and found no noticeable performance change Differential Revision: D66904879 Pull Request resolved: https://github.com/pytorch/pytorch/pull/142271 Approved by: https://github.com/ngimel	2024-12-10 18:39:59 +00:00
Xilun Wu	bce07deb96	[dtensor][cp][experiment] add CP experimental API to choose rotate method (#142093 ) Summary This PR adds a new experimental API `set_rotate_method` for Context Parallel. This API allows user to choose the desired communication method (between all-to-all and all-gather) for shards rotation. Test `pytest test/distributed/_tensor/test_attention.py` Pull Request resolved: https://github.com/pytorch/pytorch/pull/142093 Approved by: https://github.com/fegin	2024-12-10 18:25:23 +00:00
Chirag Pandya	eb84788fee	[fr] change back vlog(2) to LOG(INFO) (#142441 ) Summary: Change log message for future execution back from VLOG(2) to LOG(INFO). This message is useful for Flight Recorder to verify that flight recorder dumps completed successfully (or not). Test Plan: Tested manually on a mast job and noted that the INFO message was as expected. (meta only link: https://fburl.com/mlhub/iui2tpc9) ``` [trainer5]:I1208 10:21:00.772841 7528 ProcessGroupNCCL.cpp:1294] [PG ID 0 PG GUID 0(precheck) Rank 21] future is successfully executed for: Flight recorder dump in heartbeatMonitor ``` Differential Revision: D66996439 Pull Request resolved: https://github.com/pytorch/pytorch/pull/142441 Approved by: https://github.com/fduwjj	2024-12-10 17:43:22 +00:00
Yidi Wu	6713b457ae	[hop][dynamo] support torch.SymInt inputs (#141524 ) Fixes https://github.com/pytorch/pytorch/issues/141305. ```python class M(torch.nn.Module): def forward(self, x, y, z): a = y.shape[0] b = z.shape[0] def true_fn(x): return x + a def false_fn(x): return x + b * z # When exporting with non-strict: a and b are symints, # so torch.compile need to wrap and trace symint inputs. return torch.cond(x.shape[0] > 5, true_fn, false_fn, (x,)) ``` In non-strict export, when inputs are annotated with dynamic shape, the a, and b in above example are torch.SymInt type. true_fn and false_fn will have closure that're of torch.SymInt types. The error is triggered because we didn't handle SymInt inputs in dynamo and ends up using a UserDefinedObjectVariable for it, which doesn't have a proxy. We added support by following how we handle SymBool input previously. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141524 Approved by: https://github.com/zou3519 ghstack dependencies: #141610, #142185	2024-12-10 17:33:57 +00:00
Yidi Wu	7eda06b366	skip test dynamo for aot_dispatch tests on ci (#142185 ) A lot of tests in test_aotdispatch.py is not meaningful (from user's perspective) when we run with dynamo. So we skip them. Pull Request resolved: https://github.com/pytorch/pytorch/pull/142185 Approved by: https://github.com/zou3519 ghstack dependencies: #141610	2024-12-10 17:33:57 +00:00
Yidi Wu	b838bdd4d4	[dynamo] remove unnecessary set_example_value for SymBool input. (#141610 ) These are automatically done in create_graph_input so we can remove them. Code refactoring only. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141610 Approved by: https://github.com/zou3519	2024-12-10 17:33:48 +00:00
Zhengxu Chen	1986b46d63	[export] Change Tuple[()] to bool in schema to sync with thrift. (#142257 ) Summary: In thrift schema, we represent every None value as "True/False" while we represent None as () in OSS schema. This will cause some inconsistency between the type systems and the simplest thing to do here is changing Tuple[()] to bool in oss schema. This change should NOT cause version bump, because on deserializer side we never read the value from as_none fields, as it doesn't have real meaning. Therefore this schema change should be considered as safe. Test Plan: CI Reviewed By: SherlockNoMad Differential Revision: D66888892 Pull Request resolved: https://github.com/pytorch/pytorch/pull/142257 Approved by: https://github.com/yiming0416, https://github.com/hl475	2024-12-10 17:13:35 +00:00
Marvin Kim	b1b0afb8e8	[BE] Add type annotation to eliminate_dead_code (#142251 ) Test Plan: CI Reviewed By: evanleed D-ifferential Revision: D66887283 Pull Request resolved: https://github.com/pytorch/pytorch/pull/142251 Approved by: https://github.com/ezyang, https://github.com/Skylion007	2024-12-10 17:09:21 +00:00
Tugsbayasgalan Manlaibaatar	09b2232fd1	Make core_aten_decomp to be alias to export table (#140086 ) Differential Revision: [D64554098](https://our.internmc.facebook.com/intern/diff/D64554098/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/140086 Approved by: https://github.com/bdhirsh	2024-12-10 17:04:59 +00:00
eellison	fa746e3eeb	[Easy] factor out inductor ophandler decompositions (#142400 ) Factor out inductor operator decompositions Pull Request resolved: https://github.com/pytorch/pytorch/pull/142400 Approved by: https://github.com/Chillee, https://github.com/jansel ghstack dependencies: #134532, #142350	2024-12-10 16:58:36 +00:00
eellison	1fb3d5a4e3	Update low prec codegen for div/mod (#142350 ) Div/mod in fp16/bf16 requires a downcast to preserve its inputs' dtypes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/142350 Approved by: https://github.com/blaine-rister ghstack dependencies: #134532	2024-12-10 16:50:28 +00:00
eellison	59ab3825e7	Prologue Fusion (#134532 ) This PR extends our ability to fuse pointwise nodes onto triton templates with the ability to fuse pointwise nodes into triton templates - prologue fusion. Similar to the store_output api: `{{store_output(("idx_m", "idx_n"), "acc", "mask")}}` And the modification api: ``` {{ modification( subgraph_number=0, output_name="post_mod_scores", score="qk", out="qk" ) \| indent_except_first(1) }} ``` We have: ```{{load_input("B", "b", ("idx_m", "idx_n"), mask=None if EVEN_K else "b_mask", indent_width=8)}}``` Because we are now loading the input with explicit indices and mask, I needed to rewrite the mm kernel to no longer update the [pointers by BLOCK_K](`bb03ef7aca/torch/_inductor/kernel/mm.py (L110-L111)`) on every iteration and instead on each iteration compute indices from the the k_idx of each loop. This did not have any perf difference. There are a couple main use cases for prologue fusion: - Fusing dequants into a matmul. particularly for more bandwidth bound scenarios. - Fusing gather into a matmul. This is useful particularly in MOE. See https://github.com/pytorch/pytorch/issues/134535 for more details. Prologue fusion is generally much less profitable than epilogue fusion, because it must be applied to an element of an input on each loop of the matmul, compared to only once in the epilogue (gather into matmul is a potential exception). Accordingly, we are much less aggressive in attempting to fuse prologue fusion. We only attempt fusion if it does not increase the number of memory bytes read instead the triton template, multipled by a small factor to allow gathers. This restricts reliably unprofitable fusions like fp32->fp16 inside kernel. In future pr we could potentially have api of being more aggressive if we know we are in a bandwidth bound regime. See: https://github.com/pytorch/pytorch/pull/134532/files#diff-d2539c9c8dc6a3d7e457767a880612e96d3c85752a77ead49a9e4e00a3e4c3c7R3060-R3066 Other notes: By default we will upcast to fp32 inside every kernel. This matches eager numerics. This is fine enough for epilogue because it is only done once (although it is probably unnecessary for say a relu) but tanks perf for prologue. I am currently using the `codegen_upcast_to_fp32` option to avoid it, but that will not work for libdevice calls that require fp32. We will need https://github.com/pytorch/pytorch/pull/136778/ and dtype-aware codegen to upcast fp16 ops into libdevice calls. With prologue fusion, we now have essentially separate kernels for each input, and for the output. I had to increase the number of fields that are swapped out in `set_subgraph_body` by a large number :/ I also update the fusion logic because the inputs will have a different group than the outputs. Maybe as part of enabling multiple outputs, this could get cleaned up a bit so.. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134532 Approved by: https://github.com/jansel	2024-12-10 16:25:57 +00:00
Sam Larsen	a751558467	[logging] Fix bug involving missing compilation_metrics fields in tlparse logs (#142423 ) Summary: The line of code that's compiling the set of compilation_metrics to include in the corresponding tlparse log is missing the "legacy" and "common" fields populated above. Fix is to make sure we consider all fields in the compilation_metrics object. Test Plan: Before: https://fburl.com/d6em8csg (e.g, https://fburl.com/c19s7ny0) After: https://fburl.com/5zr6kbvf (e.g, https://fburl.com/3hp14ht2) Pull Request resolved: https://github.com/pytorch/pytorch/pull/142423 Approved by: https://github.com/ezyang	2024-12-10 15:58:43 +00:00
Richard Barnes	882b6af219	c10::string_view -> std::string_view in autograd (#142354 ) Differential Revision: D66939966 Pull Request resolved: https://github.com/pytorch/pytorch/pull/142354 Approved by: https://github.com/Skylion007	2024-12-10 15:43:41 +00:00
Richard Barnes	7e41717a26	c10::string_view -> std::string_view in caffe2/jit (#142383 ) Test Plan: Sandcastle Differential Revision: D66939979 Pull Request resolved: https://github.com/pytorch/pytorch/pull/142383 Approved by: https://github.com/malfet	2024-12-10 15:42:28 +00:00
Andrew Gu	dd2d0c6b80	[FSDP2] Gate PT2 code for torch deploy (#142456 ) See diff for internal details Differential Revision: [D67003832](https://our.internmc.facebook.com/intern/diff/D67003832) Pull Request resolved: https://github.com/pytorch/pytorch/pull/142456 Approved by: https://github.com/yf225, https://github.com/weifengpy, https://github.com/fegin	2024-12-10 14:39:07 +00:00
gasoonjia	ff059587c6	support condition branch in ao debug handler (#141516 ) This diff introduced the supportive of condition statement into ao debug handler generation. Most of code borrowed from ExecuTorch to avoid circle dependency issue. Differential Revision: [D66270691](https://our.internmc.facebook.com/intern/diff/D66270691/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/141516 Approved by: https://github.com/jerryzh168	2024-12-10 14:05:12 +00:00
PyTorch MergeBot	75530885ba	Revert "[BE] Add type annotation to eliminate_dead_code (#142251 )" This reverts commit 3d04de6b2f78a78bc28ce82d6e3a4af1867ec7d8. Reverted https://github.com/pytorch/pytorch/pull/142251 on behalf of https://github.com/jeanschmidt due to checking if reverting will fix 'FAILED [5.0221s] test_dataloader.py::TestIndividualWorkerQueue::test_ind_worker_queue' on windows ([comment](https://github.com/pytorch/pytorch/pull/142251#issuecomment-2531706362))	2024-12-10 13:57:00 +00:00
Michael Lazos	a3abe1a5ae	Add support for bfloat16 atomic adds in fbcode (#141857 ) This adds support for bfloat16 atomic add in fbcode (OSS will have to wait until those changes are upstreamed to triton) Originally I attempted to write inline asm, but the triton API was not flexible enough to support this use case. In the long run the right answer is to implement this properly in OSS triton. relevant issues: * https://github.com/pytorch/pytorch/issues/137425 in fbcode only * https://github.com/pytorch/pytorch/issues/97016 Pull Request resolved: https://github.com/pytorch/pytorch/pull/141857 Approved by: https://github.com/eellison	2024-12-10 11:40:15 +00:00
jianan-gu	d51e6fa7f6	[inductor][cpp] Add FlexAttention support for CPU inference (#141453 ) This PR brings the FlexAttention inference support for the inductor backend in torch.compile (support precisions: bf16 and fp32) on CPUs. Based on the existing CPP template, this PR extends and implements a FlexAttention CPP template to support broad attention variants, and meanwhile brings optimized performance on CPUs. With this, users can transparently extend their Flex Attention usages to CPUs with good and common support from torch.compile, both functionality and performance. For UT tests, in this PR, we include partial critical tests for CPUs as the following (conduct inference tests): ``` pytest test/inductor/test_flex_attention.py `TestFlexAttention` #common functions: run_test preprocess_paged_attention run_paged_attention run_test_with_paged_attention run_test_with_call run_dynamic_test run_automatic_dynamic_test #test functions: test_builtin_score_mods test_builtin_score_mods_automatic_dynamic test_builtin_score_mods_different_seqlen test_builtin_score_mods_different_block_size test_kv_batch_broadcast test_GQA test_cpu_error_message_return_lse test_validate_cpu_dtype_error_message `TestPagedAttention` #test function: test_paged_builtin_score_mods ``` For the rest UTs in `test/inductor/test_flex_attention.py ` and `test/inductor/test_flex_decoding.py`, due to bigger lines of changes (1500+ LOC) that make this PR hard to review, will submit another PR specific for CPU device UTs enabling and refactor. Besides, more optimizations are also planned in follow up PRs, including: - Block sparse computation - Flash decoding tuning Pull Request resolved: https://github.com/pytorch/pytorch/pull/141453 Approved by: https://github.com/drisspg, https://github.com/leslie-fang-intel Co-authored-by: Wu, Chunyuan <chunyuan.wu@intel.com>	2024-12-10 11:11:09 +00:00
Chen, Zejun	5ba61d7fb8	[Fix][Profiler UT] Skip CPU for the UT test/profiler/test_execution_trace.py::test_execution_trace_with_pt2 (#142027 ) [Fix] Skip CPU device for the UT `test_execution_trace_with_pt2` skip CPU because triton is only for GPUs. This UT is designed to test profiling the triton kernels. Pull Request resolved: https://github.com/pytorch/pytorch/pull/142027 Approved by: https://github.com/aaronenyeshi	2024-12-10 09:29:19 +00:00
Marvin Kim	3d04de6b2f	[BE] Add type annotation to eliminate_dead_code (#142251 ) Test Plan: CI Reviewed By: evanleed Differential Revision: D66887283 Pull Request resolved: https://github.com/pytorch/pytorch/pull/142251 Approved by: https://github.com/ezyang, https://github.com/Skylion007	2024-12-10 09:27:29 +00:00
Alex Denisov	539286a67b	Inductor annotations (#130429 ) Add NVTX annotations around training phases and buffer computations RFC/discussion: https://dev-discuss.pytorch.org/t/rfc-performance-profiling-at-scale-with-details-nvtx-annotations/2224 <img width="2160" alt="Screenshot 2024-07-10 at 11 48 04" src="https://github.com/pytorch/pytorch/assets/1175576/9ade139c-d393-473f-9b68-6c25da367dc4"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/130429 Approved by: https://github.com/aorenste, https://github.com/eellison, https://github.com/albanD Co-authored-by: Cedric GESTES <cedric.gestes@flex.ai>	2024-12-10 08:53:39 +00:00
PyTorch MergeBot	24650c3caa	Revert "[Inductor][Easy] Fix a test failure in loop_ordering_after_fusion (#142273 )" This reverts commit e4ecb09b3513e0ee53ed87496d8bfdf5d2944042. Reverted https://github.com/pytorch/pytorch/pull/142273 on behalf of https://github.com/huydhn due to Internal has been ninja unlanded D66906175 ([comment](https://github.com/pytorch/pytorch/pull/142273#issuecomment-2530751665))	2024-12-10 08:16:58 +00:00
Mu-Chu Lee	d3d1a78774	[AOTInductor] Add standalone test for compilation from ExportedProgram (#142327 ) Summary: Provide a standalone path to compile and run a ExportedProgram in C. Test Plan: (1) Generate a compiled model from ExportedProgram ``` python generate_lowered_cpu.py --input-path /tmp/$USER/ep.pt --output-path /tmp/$USER/final.pt ``` (2) Compile a standalone test runner ``` TORCH_ROOT_DIR=/data/users/$USER/pytorch sh standalone_compile.sh standalone_test.cpp standalone_test.out ``` (3) Run test for the compiled model in step (1) ``` LD_LIBRARY_PATH=/data/users/$USER/pytorch/build/lib ./standalone_test.out /tmp/$USER/final.pt ``` Differential Revision: D66872380 Pull Request resolved: https://github.com/pytorch/pytorch/pull/142327 Approved by: https://github.com/hl475	2024-12-10 06:50:09 +00:00
blzheng	b9e253cb72	[inductor] update numbytes_hint for NoneLayout to allow more fusions (#141766 ) We found that [this commit](`6eca0aee76`) caused a ~6% performance drop in ViT INT8. This was due to changes to the `numbytes_hint` for `NoneLayout`. In this PR, we reverted the changes in `numbytes_hint` to allow more fusions. ``` class Model(torch.nn.Module): def __init__(self): super().__init__() self.dense = torch.nn.Linear(768, 768) self.layernorm = torch.nn.LayerNorm(768, eps=1e-12) def forward(self, context_layer, hidden_states): attention_output = self.dense(context_layer) hidden_states = attention_output + hidden_states layer_output = self.layernorm(hidden_states) return layer_output ``` The generated code before (left) and after (right) this PR is as follows: ![image](https://github.com/user-attachments/assets/0ec65ae5-103e-4e2c-bf7c-e8bed24fc179) Pull Request resolved: https://github.com/pytorch/pytorch/pull/141766 Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5, https://github.com/jansel	2024-12-10 06:45:07 +00:00
Chien-Chin Huang	daa27fe59d	[DeviceMesh] Call no_dispatch before doing tensor slicing in DeviceMesh (#142287 ) Summary: DeviceMesh's tensor operation is a control plane operation not data plane and should not be affected by FakeTensorMode. Pull Request resolved: https://github.com/pytorch/pytorch/pull/142287 Approved by: https://github.com/XilunWu	2024-12-10 06:33:01 +00:00
Ting Lu	f26b75b7ac	[aarch64] add CUDA 12.6 sbsa nightly binary (#142335 ) related to #138440 Pull Request resolved: https://github.com/pytorch/pytorch/pull/142335 Approved by: https://github.com/atalman	2024-12-10 06:19:28 +00:00
Bin Bao	1cb2ebd740	[AOTI] Fix #140546 and support AOTI package load for Intel GPU. (#140664 ) Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): * #140686 * __->__ #140664 * #140269 * #140268 * #135320 * #135318 * #139026 Fix #140546 Pull Request resolved: https://github.com/pytorch/pytorch/pull/140664 Approved by: https://github.com/desertfire, https://github.com/EikanWang ghstack dependencies: #140268, #140269 Co-authored-by: Bin Bao <binbao@meta.com>	2024-12-10 05:05:08 +00:00
Bin Bao	6680a83e89	[AOTI XPU] Support AOT Inductor for Intel GPU. (#140269 ) This PR add XPU support for AOT Inductor, and reuse the corresponding UT. Pull Request resolved: https://github.com/pytorch/pytorch/pull/140269 Approved by: https://github.com/desertfire, https://github.com/EikanWang ghstack dependencies: #140268 Co-authored-by: Bin Bao <binbao@meta.com>	2024-12-10 05:05:08 +00:00
PyTorch MergeBot	a1c6cf7e9f	Revert "Add UTs for accelerator device-agnostic runtime APIs (#133572 )" This reverts commit 952514f0c8d8ff2e1719e0ca82b0d178a5c5ff45. Reverted https://github.com/pytorch/pytorch/pull/133572 on behalf of https://github.com/malfet due to Sorry for reverting your PR, but it segfaults on MacOS ([comment](https://github.com/pytorch/pytorch/pull/133572#issuecomment-2530354401))	2024-12-10 04:42:55 +00:00
PyTorch MergeBot	adbfdbd6a0	Revert "Add device-agnostic runtime Device/Stream C++ API (#138677 )" This reverts commit f84e533a2cb89a42c021dce7d22af7d5bd5f5ac1. Reverted https://github.com/pytorch/pytorch/pull/138677 on behalf of https://github.com/malfet due to Sorry for reverting your PR, but it segfaults on MacOS ([comment](https://github.com/pytorch/pytorch/pull/133572#issuecomment-2530354401))	2024-12-10 04:42:55 +00:00
Edward Z. Yang	08e9ceb0a4	Make sure the benchmark build config is tested on trunk for easy bisect (#142376 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/142376 Approved by: https://github.com/atalman	2024-12-10 04:42:52 +00:00
andrewkho	4e7056d94d	Fixes in-order test flakiness (#142389 ) Fixes #142343 Pull Request resolved: https://github.com/pytorch/pytorch/pull/142389 Approved by: https://github.com/michael-diggin, https://github.com/divyanshk	2024-12-10 04:19:20 +00:00
Avik Chaudhuri	e3886fb13c	misc. fixes to unflatten (#142141 ) Combining several fixes to unflatten for bugs revealed by random graph testing. The fixes target two categories of bugs: 1. Some bugs show up as exponential blowups for largish system of nn modules. These are fixes by converting lists to sets, using caching, or otherwise rewriting to reuse computation more effiicently. 2. Other bugs were due to missing intermediate modules created when attributes such as submodules and buffers are accessed through longish paths before calling the corresponding intermediate modules, or missing attributes such as buffers and constants in submodules corresponding to multiple calls. Differential Revision: [D66659795](https://our.internmc.facebook.com/intern/diff/D66659795/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/142141 Approved by: https://github.com/ydwu4	2024-12-10 03:45:13 +00:00
Shuqi Yang	e4ecb09b35	[Inductor][Easy] Fix a test failure in loop_ordering_after_fusion (#142273 ) Summary: (Since I am trying the other solution for https://github.com/pytorch/pytorch/pull/141082, I moved out the test case fixes from that pr to a separate pr to land first.) ----- Testing float8 dynamic scaling case with `TORCHINDUCTOR_LOOP_ORDERING_AFTER_FUSION=1` didn't make any difference. The test case for fp8 (https://github.com/pytorch/pytorch/blob/main/test/inductor/test_loop_ordering.py#L425) is also failing, https://www.internalfb.com/intern/test/844425111960859?ref_report_id=0 ------- The main change here is to modify the condition of calling `loop_reordering` from `shared_data_score == 0` to `shared_data_score < config.score_fusion_memory_threshold`. Before the change: `shared_data_score > 0 -> won't loop_reorder -> can't fused because of shared_data_score < config.score_fusion_memory_threshold` After the change: `shared_data_score > 0 -> loop_reorder (shared_data_score < config.score_fusion_memory_threshold) -> get a larger shared_data_score -> fused` ---- It's the same issue as fixed in https://github.com/pytorch/pytorch/pull/136782. But the condition to call loop_reorder might be changed later, causing the test case to fail again. Test Plan: ``` buck2 test 'fbcode//mode/opt' caffe2/test/inductor:loop_ordering ``` And ran a float8 dynamic scaling training script to verify it e2e ----- Differential Revision: D66906175 Pull Request resolved: https://github.com/pytorch/pytorch/pull/142273 Approved by: https://github.com/eellison	2024-12-10 02:58:04 +00:00
Nikita Shulga	3291b0a013	[DataParallel] Skip for MPS device (#142448 ) As `torch._C._scatter` is only defined for CUDA/ROCm (and may be XPU?) This is a regression introduced by https://github.com/pytorch/pytorch/pull/141098 that went unnoticed due to https://github.com/pytorch/pytorch/issues/142206 Test plan: ``` python test_autograd.py -v -k test_dataparallel_saved_tensors_hooks ``` Before this change it failed with ``` ERROR: test_dataparallel_saved_tensors_hooks (__main__.TestMultithreadAutograd.test_dataparallel_saved_tensors_hooks) ---------------------------------------------------------------------- Traceback (most recent call last): File "/Users/malfet/git/pytorch/pytorch/torch/testing/_internal/common_utils.py", line 3108, in wrapper method(args, *kwargs) ~~~~~~^^^^^^^^^^^^^^^^^ File "/Users/malfet/git/pytorch/pytorch/test/test_autograd.py", line 13074, in test_dataparallel_saved_tensors_hooks model = torch.nn.DataParallel(Model()) File "/Users/malfet/git/pytorch/pytorch/torch/nn/parallel/data_parallel.py", line 153, in __init__ raise RuntimeError("no available devices were found") RuntimeError: no available devices were found ``` After this change it passes Pull Request resolved: https://github.com/pytorch/pytorch/pull/142448 Approved by: https://github.com/kit1980	2024-12-10 02:49:23 +00:00
cyy	9a309fb4c6	Remove ConstQuantizerPtr in torchgen (#142375 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/142375 Approved by: https://github.com/albanD	2024-12-10 02:37:01 +00:00
Huy Do	41757372c4	Set timeout value for remaining lint jobs (#142444 ) Some lint jobs are using the default 30 minutes timeout, but the jobs could wait up to 90 minutes now for the Docker image to become available after https://github.com/pytorch/test-infra/pull/6013 Pull Request resolved: https://github.com/pytorch/pytorch/pull/142444 Approved by: https://github.com/wdvr	2024-12-10 02:29:44 +00:00
Zhen Wang	e83b0fa945	set CUB_VERSION to 200001 for USE_ROCM (#140861 ) Summary: currently, CUB_VERSION is 0 for USE_ROCM CUB_VERSION is used for determine whether to use advanced cub APIs for some implementation. Test Plan: `buck2 build --flagfile fbsource//arvr/mode/win/vs2022/cpp20/cuda12_5/dev --flagfile fbsource//arvr/mode/cuda/rtx30 fbsource//arvr/libraries/eye/apollo_visualizer:unit_test_apollo_hu_module_capability` `buck2 build --flagfile fbcode//mode/amd-gpu fbcode//aiplatform/modelstore/checkpointing/pyper:tensor_save_load_utils` Differential Revision: D63054638 Pull Request resolved: https://github.com/pytorch/pytorch/pull/140861 Approved by: https://github.com/eqy, https://github.com/zoranzhao, https://github.com/houseroad	2024-12-10 02:28:48 +00:00
Alex Kiefer	2f1191fb6a	Corrected metadata variable names (#142342 ) Fixes #142341 Pull Request resolved: https://github.com/pytorch/pytorch/pull/142342 Approved by: https://github.com/janeyx99	2024-12-10 02:24:31 +00:00
lzhang2	5d6acd5a31	Register Intel distributed Backend (`XCCL`) in PyTorch distributed package (#141856 ) ### Motivation: As design illustrated in Intel distributed support RFC https://github.com/pytorch/pytorch/issues/141741, two sections are needed to enable intel distributed backend (`XCCL`) support in PyTorch. 1. Intel GPU distributed Backend integration in PyTorch `torch-xpu-ops`. 2. Intel distributed Backend register in PyTorch distributed package. This PR is to contribute section 2 change. ### Example: Here is a simple example of using spawn to launch XCCL backend and perform allreduce on XPU tensors. ``` import os import torch import torch.distributed as dist import torch.multiprocessing as mp def setup(rank, world_size): os.environ['MASTER_ADDR'] = 'localhost' os.environ['MASTER_PORT'] = '29500' dist.init_process_group(rank=rank, world_size=world_size) def cleanup(): dist.destroy_process_group() def run_allreduce(rank, world_size): setup(rank, world_size) device = torch.device('xpu:{}'.format(rank)) x = torch.randn([2, 2], device=device) dist.all_reduce(x) cleanup() if __name__ == '__main__': world_size = 2 mp.spawn(run_allreduce, args=(world_size,), nprocs=world_size, join=True) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/141856 Approved by: https://github.com/kwen2501, https://github.com/gujinghui, https://github.com/albanD	2024-12-10 01:58:06 +00:00
Wouter Devriendt	b98f40a4d5	remove vulkan sdk installation on executorch build (#142424 ) pytorch-linux-jammy-py3-clang12-executorch [started to fail](https://github.com/pytorch/pytorch/actions/runs/12244909721/job/34157668780) today due to a 404 on the Vulkan SDK we use/download (1.2.198.1, 3 years old, URL: https://sdk.lunarg.com/sdk/download/1.2.198.1/linux/vulkansdk-linux-x86_64-1.2.198.1.tar.gz ) The Vulkan SDK is probably no longer needed for building Executorch, and is not used down the line for testing. This PR tests removing the installation of the SDK https://github.com/pytorch/executorch/pull/7258 Pull Request resolved: https://github.com/pytorch/pytorch/pull/142424 Approved by: https://github.com/huydhn	2024-12-10 01:50:16 +00:00
Nikita Shulga	a1688d8607	Fix test_indexing on MacOS (#142440 ) Where int64_t is long long rather than long This fixes test regression introduced by https://github.com/pytorch/pytorch/pull/140597 that went undetected due to https://github.com/pytorch/pytorch/issues/142206 Pull Request resolved: https://github.com/pytorch/pytorch/pull/142440 Approved by: https://github.com/kit1980	2024-12-10 01:46:28 +00:00
PyTorch MergeBot	871b524398	Revert "temporarily turn on keep-going/continue on error for mac (#142421 )" This reverts commit 17202ea8f6fb0eebdb14b346bb2610f08800a7df. Reverted https://github.com/pytorch/pytorch/pull/142421 on behalf of https://github.com/malfet due to We've collected enough info for now ([comment](https://github.com/pytorch/pytorch/pull/142421#issuecomment-2530010220))	2024-12-10 01:45:21 +00:00
Xilun Wu	bef103934a	[DeviceMesh][ROCm] skip ProcessGroup init test on ROCm because #ranks != #devices in CI (#142386 ) Summary Fixes #142361 Skip the DeviceMesh test since the test suite doesn't consider the case where `# ranks != # devices`. Test CI Pull Request resolved: https://github.com/pytorch/pytorch/pull/142386 Approved by: https://github.com/huydhn, https://github.com/fegin	2024-12-10 01:22:21 +00:00
Jithun Nair	a1b5067297	Enable py3.13 wheels for ROCm (#142294 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/142294 Approved by: https://github.com/huydhn	2024-12-10 01:10:24 +00:00
Max Ren	20718cdebb	[Fast Packing] Add packing ukernels to gemm config (#142191 ) Add file to buck build Differential Revision: [D66692673](https://our.internmc.facebook.com/intern/diff/D66692673/) NOTE FOR REVIEWERS: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D66692673/)! Pull Request resolved: https://github.com/pytorch/pytorch/pull/142191 Approved by: https://github.com/kirklandsign, https://github.com/digantdesai	2024-12-10 01:06:17 +00:00
Oguz Ulgen	0f6bfc58a2	Introduce remote cache key prefix to break cache (#142148 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/142148 Approved by: https://github.com/jamesjwu, https://github.com/ezyang	2024-12-10 00:35:50 +00:00
Nikita Shulga	1cb5f38328	[EZ] Skip test_zero_grid_with_backed_symbols on Mac (#142436 ) As it expects to load traced module on CUDA, which is not available on Mac `bd867d691b/test/inductor/test_aot_inductor.py (L1414)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/142436 Approved by: https://github.com/kit1980	2024-12-10 00:25:32 +00:00
Tom Ritchford	ec746f7026	Remove an unused variable from _prims_common/wrappers.py (#138480 ) ---- * Extracted from https://github.com/pytorch/pytorch/pull/133492 * albanD thinks this is a bug! Pull Request resolved: https://github.com/pytorch/pytorch/pull/138480 Approved by: https://github.com/albanD	2024-12-10 00:12:53 +00:00
Ke Wen	5743b11039	Improve messaging of ProcessGroupNCCL destructor (#142297 ) And removed some unnecessary conditions for calling `thread.join()` -- `thread.joinable()` should have covered it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/142297 Approved by: https://github.com/wconstab ghstack dependencies: #141510, #141511	2024-12-10 00:02:33 +00:00
Andrew Gu	bd867d691b	[FSDP2] Fix backward-compatible imports (#142419 ) Internal only: the before way meant that `from torch.distributed._composable.fsdp import fully_shard` was importing `fully_shard.py` not the function `fully_shard`. For some reason, the resolution order is different from open source. To fix this, we match the old import as closely as possible. Namely, we import `fully_shard.py` contents from `.fully_shard`. This should force that import to take precedence. @diff-train-skip-merge Differential Revision: [D66990327](https://our.internmc.facebook.com/intern/diff/D66990327) Pull Request resolved: https://github.com/pytorch/pytorch/pull/142419 Approved by: https://github.com/weifengpy	2024-12-09 23:56:32 +00:00
Shangdi Yu	bcddae14ec	Enhance "from_node" node meta to track source recursively (#142066 ) Summary: Change the "from_node" node meta format to be able to track the provenance of nodes recursively. The new "from_node" format is a a list node NodeSource: ``` class NodeSource: self.node_name: str self.target: str self.graph_id: int self.pass_name: str self.action: str self.from_node: List[NoedSource] ``` This is in preparation for the inductor provenance tracking. For background, the inductor provenance tracking doc: https://docs.google.com/document/d/1dGh9myqNhywmbfP0Quzx_f04bghDFlj8cawj8MopiO8/edit?fbclid=IwZXh0bgNhZW0CMTEAAR0jUQ0Tf4ROLDED8Y_eIzrU0KVZVdRmyIQLp-avt-kGRPI_VgYVNyjH_q0_aem_HCQ_pxHDiwOkO9mQyWB2-g&tab=t.0 (internal only), Test Plan: ``` buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test:test_export -- -r test_unflatten_multiple_graphs_state buck run mode/dev-nosan caffe2/test:fx -- -r node_source ``` Differential Revision: D66737916 Pull Request resolved: https://github.com/pytorch/pytorch/pull/142066 Approved by: https://github.com/avikchaudhuri	2024-12-09 23:39:15 +00:00
Bin Bao	42b222edef	[AOTI] Fix an issue when fallback op does not return a value (#142339 ) Summary: Refine https://github.com/pytorch/pytorch/pull/137660 to support fallback op without a return value. Differential Revision: D66939108 Pull Request resolved: https://github.com/pytorch/pytorch/pull/142339 Approved by: https://github.com/henrylhtsang	2024-12-09 23:24:29 +00:00
clee2000	17202ea8f6	temporarily turn on keep-going/continue on error for mac (#142421 ) See https://github.com/pytorch/pytorch/pull/142270 for additional info. Make all mac default shard tests run with keep going / continue on error so we can see all the test failures. Red signal will show up later, but you can see failing tests mid run on HUD by clicking the additional test failures button After the job is finished, searching for "consistently: " in the logs will find the failed tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/142421 Approved by: https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2024-12-09 23:24:10 +00:00
PyTorch MergeBot	beeffe77e4	Revert "[inductor][cpp] Add FlexAttention support for CPU inference (#141453 )" This reverts commit db379ed1ada58608d4d3c5c35777da051e4e49e5. Reverted https://github.com/pytorch/pytorch/pull/141453 on behalf of https://github.com/malfet due to This breaks tests on platforms compiled without MKLDNN, namely MacOS, see https://github.com/pytorch/pytorch/actions/runs/12245441371/job/34159967794 ([comment](https://github.com/pytorch/pytorch/pull/141453#issuecomment-2529710573))	2024-12-09 22:57:59 +00:00
Blaine Burton Rister	8d24eb0c94	[Inductor] Represent size_hints as a dict (#142249 ) Preparatory refactor for https://github.com/pytorch/pytorch/pull/137243. # Feature Follow up to https://github.com/pytorch/pytorch/pull/141751. Since we now represent `numels` as a dict, it's natural to extend this to `size_hints`. The latter are basically just the former rounded up to the nearest power of 2. This simplifies various heuristics such as the coordinate descent tuner. Where we previously needed to determine which index in `size_hints` corresponds to each dimension, now we can just query by prefix. This will be especially important when we enable 2D reductions, as it becomes harder to keep track of these things when we have multiple reduction dimensions. (See the previous PR for some examples.) # Test plan The existing CI provides good coverage. This PR modifies a few tests which explicitly constructed size hints. Pull Request resolved: https://github.com/pytorch/pytorch/pull/142249 Approved by: https://github.com/jansel	2024-12-09 22:31:53 +00:00
Catherine Lee	4b69a68c7c	[Do not revert] Re-enable Mac testing (#142270 ) The bash script modification in https://github.com/pytorch/pytorch/pull/135386 results in tests on mac in default shard not running. This PR is expected to cause test failures, but we need to start getting signal, so landing with known failures Pull Request resolved: https://github.com/pytorch/pytorch/pull/142270 Approved by: https://github.com/malfet, https://github.com/seemethere, https://github.com/atalman Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2024-12-09 22:26:26 +00:00
Scott Wolchok	274223d719	Add and use borrow_arrayref_tensor_as_tensor (#142183 ) Differential Revision: [D66847773](https://our.internmc.facebook.com/intern/diff/D66847773/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/142183 Approved by: https://github.com/desertfire, https://github.com/hl475 ghstack dependencies: #142340, #142182	2024-12-09 22:23:21 +00:00
Scott Wolchok	18d25aa7aa	Rename convert_arrayref_tensor_to_tensor to copy_arrayref_tensor_to_tensor (#142182 ) Be explicit about what we are doing, in preparation for adding borrow_arrayref_tensor_as_tensor. Differential Revision: [D66847772](https://our.internmc.facebook.com/intern/diff/D66847772/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/142182 Approved by: https://github.com/desertfire ghstack dependencies: #142340	2024-12-09 22:23:21 +00:00
Scott Wolchok	dc1ef9afb4	Reapply #142091 (Unbreak dynamic shape minimal arrayref interface tests) (#142340 ) Simple bug got introduced somewhere. The original PR was reverted because it broke (caused unexpected successes for) some tests in test_aot_inductor_arrayref.py that still only run internally because #123691 hasn't been fixed. I've fixed those. Differential Revision: [D66890276](https://our.internmc.facebook.com/intern/diff/D66890276/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/142340 Approved by: https://github.com/hl475	2024-12-09 22:23:21 +00:00
Ke Wen	cca33d50b9	[PGNCCL] Use long/short wait for different non-blocking calls (#142291 ) In nonblocking mode, we always check if the NCCL communicator is ready between issuing commands to it. Today this is done by the `waitReady()` function. Unfortunately, the `waitReady()` function is burned with `C10D_NCCL_CHECK_TIMEOUT_SLEEP` which would sleep for an interval between two consecutive checks. While this is nice when waiting for comm init or finalize, it degrades performance of collective calls (which would almost certainly return success immediately.) This PR adds a `bool longInterval` argument to `waitReady` and let call site determine whether long wait is likely; if not, `waitReady` would use `sched_yield()` to more eagerly check for readiness. Thanks @eqy for reporting an issue that small collectives has perf impact in nonblocking mode. Pull Request resolved: https://github.com/pytorch/pytorch/pull/142291 Approved by: https://github.com/eqy, https://github.com/fduwjj	2024-12-09 22:19:58 +00:00
Ke Wen	452e1a7840	[c10d] Update `backend` arg documentation (#142404 ) Update doc to reflect change brought by https://github.com/pytorch/pytorch/pull/142216 Pull Request resolved: https://github.com/pytorch/pytorch/pull/142404 Approved by: https://github.com/XilunWu	2024-12-09 21:53:44 +00:00
Henry Tsang	12f1989a4a	[aoti package] seek 0 after loading buffer (#142204 ) Differential Revision: D66855265 Pull Request resolved: https://github.com/pytorch/pytorch/pull/142204 Approved by: https://github.com/chenyang78, https://github.com/angelayi	2024-12-09 21:53:28 +00:00
Joel Schlosser	4c7688ca06	Add pytest support for unittest.subTests to CI env (#142238 ) Fixes #142157 Pull Request resolved: https://github.com/pytorch/pytorch/pull/142238 Approved by: https://github.com/malfet, https://github.com/huydhn ghstack dependencies: #142243	2024-12-09 21:48:20 +00:00
Ke Wen	5d3bc633ff	[PGNCCL] Rework NCCLComm dtor to avoid clash with CUDA driver shutdown (#141511 ) Making CUDA or NCCL calls in object destruction can be dangerous because CUDA context may have exited before the the destructor, in which case, the CUDA calls would see a "CUDA driver shutting down" error. this PR does take a destroy call away from NCCLComm dtor, and doesn't add a new one. If users are calling destroy_process_group or abort_process_group as recommended, then we are destroying for them, and otherwise we are OK with letting them possibly leak resources (and get a warning). Pull Request resolved: https://github.com/pytorch/pytorch/pull/141511 Approved by: https://github.com/eqy, https://github.com/wconstab ghstack dependencies: #141510	2024-12-09 21:41:15 +00:00
Jeeja	4dbecf3ba7	Implement CPU pins functions for HPU hooks (#139495 ) Link CPU pins function in HPU hooks to the host allocator in tensor_empty Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/139495 Approved by: https://github.com/zou3519	2024-12-09 21:37:20 +00:00
Ke Wen	e52a534994	[PGNCCL] Deprecate suppport of onCompletionHook (#142390 ) The usage of `onCompletionHook` is mostly similar to what Flight Recorder does today -- for example, measuring how long a collective takes and put it into a profiler's "database". Since FR already records and can dump info like this, we are considering deprecating the onCompletionHook support to save a side thread. (Each PG runs 3 side threads today, which is resource consuming and complicates the code) User can file an issue if additional information needs to be recorded. They can also file an RFC if Flight Recorder needs to accept plugins that customize the recording. Pull Request resolved: https://github.com/pytorch/pytorch/pull/142390 Approved by: https://github.com/fduwjj, https://github.com/fegin	2024-12-09 21:11:33 +00:00
xinan.lin	04312293a2	[Inductor] Fix wrong CSEVariable dtype for reduction. Fix #141861 (#142189 ) Fix #141861 Pull Request resolved: https://github.com/pytorch/pytorch/pull/142189 Approved by: https://github.com/jansel	2024-12-09 21:07:35 +00:00
xinan.lin	a4dedf27b9	[Inductor] Generalize newly introduced device-bias code to align the behavior of XPU unroll reduction with cuda. (#142348 ) Fix #141861 Pull Request resolved: https://github.com/pytorch/pytorch/pull/142348 Approved by: https://github.com/desertfire, https://github.com/jansel	2024-12-09 20:58:35 +00:00
Nikita Shulga	8bf28b3613	[EZ] Do not checkout builder for Linux builds (#142282 ) All logic should have been migrated to .ci/manywheel folder from builder repo a while back Pull Request resolved: https://github.com/pytorch/pytorch/pull/142282 Approved by: https://github.com/atalman ghstack dependencies: #142276, #142277, #142382	2024-12-09 20:52:13 +00:00
Mwiza Kunda	cb0a302dde	Fix fallthrough behaviour when Meta in TLS include set (#141581 ) Fixes https://github.com/pytorch/pytorch/issues/141120 Registering a fallthrough for a backend correctly alters nonFallthroughKeysPerBackend_[backend_idx]. However, the backend_idx calculation does not take into account the local dispatch key set, which is used to temporarily turn on Meta as a backend. This means that makeFallthrough does not behave exactly as if it was a normal function which redispatched rather than a "fake function" implemented with a key mask. So e.g. impl::computeDispatchKeySet(ks, nonFallthroughKeysPerBackend_[backend_idx]); will exclude keys like Meta which may be in the TLS include set. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141581 Approved by: https://github.com/bdhirsh	2024-12-09 20:32:44 +00:00
Bradley Davis	a1bd784ffd	add CK BMM instances (#142002 ) Summary: adds instances of CK DeviceBatchedGemmMultiD_Xdl_CShuffle_V3 for aten.bmm CK backend. adds simple heuristic that will need improving over time. adds support for TN, NT, TT and NN layouts. Differential Revision: D66662554 Pull Request resolved: https://github.com/pytorch/pytorch/pull/142002 Approved by: https://github.com/mxz297, https://github.com/xw285cornell	2024-12-09 20:31:24 +00:00
PyTorch MergeBot	5c76a2834d	Revert "add torchrec collectives to enforce global ordering (#141970 )" This reverts commit ceb94d6a7d38930d662e7eb71b9c7620de8c2997. Reverted https://github.com/pytorch/pytorch/pull/141970 on behalf of https://github.com/malfet due to Apologies for reverting this change, but it broke MacOS testing, but CI was broken at the time ([comment](https://github.com/pytorch/pytorch/pull/141970#issuecomment-2529367680))	2024-12-09 20:25:04 +00:00
Nikita Shulga	960a81fdcd	[EZ] Delete unsued `binary_macos_test.sh` (#142382 ) According to https://github.com/search?type=code&q=binary_macos_test.sh+repo%3Apytorch%2Fpytorch (and grep in the repo) it's not used anywhere Pull Request resolved: https://github.com/pytorch/pytorch/pull/142382 Approved by: https://github.com/atalman ghstack dependencies: #142276, #142277	2024-12-09 19:37:56 +00:00
cyy	b4c0973b59	[2/N] Apply bugprone-unchecked-optional-access (#141091 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/141091 Approved by: https://github.com/Skylion007, https://github.com/albanD Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>	2024-12-09 19:30:19 +00:00
Hyunho Yeo	005c5694eb	Refactor "torch.mtia.memory_stats" API (#141723 ) Summary: This diff refactors the code for the "torch.mtia.memory_stats" API to maintain the same file hierarchy as its CUDA counterpart: - All device memory APIs are now located under ".../mtia/memory.py". - Device memory APIs can be accessed using either "torch.mtia.XYZ" or "torch.mtia.memory.XYZ". Test Plan: Passed a local unit test: `buck run //mtia/host_runtime/torch_mtia/tests:test_torch_mtia_api` ``` Ran 14 tests in 16.657s OK I1127 11:06:06.505201 2133030 afg_bindings.cpp:943] afg-aten::mul.out-dtype_Float-bBtLGD6Y executable has been unloaded I1127 11:06:06.506654 2133030 afg_bindings.cpp:943] afg-add-dtype_Float-fa37JncC executable has been unloaded W1127 11:06:08.731138 2133030 HazptrDomain.h:148] Tagged objects remain. This may indicate a higher-level leak of object(s) that use hazptr_obj_cohort. ``` Differential Revision: D66549179 Pull Request resolved: https://github.com/pytorch/pytorch/pull/141723 Approved by: https://github.com/nautsimon	2024-12-09 19:19:19 +00:00
jianan-gu	db379ed1ad	[inductor][cpp] Add FlexAttention support for CPU inference (#141453 ) This PR brings the FlexAttention inference support for the inductor backend in torch.compile (support precisions: bf16 and fp32) on CPUs. Based on the existing CPP template, this PR extends and implements a FlexAttention CPP template to support broad attention variants, and meanwhile brings optimized performance on CPUs. With this, users can transparently extend their Flex Attention usages to CPUs with good and common support from torch.compile, both functionality and performance. For UT tests, in this PR, we include partial critical tests for CPUs as the following (conduct inference tests): ``` pytest test/inductor/test_flex_attention.py `TestFlexAttention` #common functions: run_test preprocess_paged_attention run_paged_attention run_test_with_paged_attention run_test_with_call run_dynamic_test run_automatic_dynamic_test #test functions: test_builtin_score_mods test_builtin_score_mods_automatic_dynamic test_builtin_score_mods_different_seqlen test_builtin_score_mods_different_block_size test_kv_batch_broadcast test_GQA test_cpu_error_message_return_lse test_validate_cpu_dtype_error_message `TestPagedAttention` #test function: test_paged_builtin_score_mods ``` For the rest UTs in `test/inductor/test_flex_attention.py ` and `test/inductor/test_flex_decoding.py`, due to bigger lines of changes (1500+ LOC) that make this PR hard to review, will submit another PR specific for CPU device UTs enabling and refactor. Besides, more optimizations are also planned in follow up PRs, including: - Block sparse computation - Flash decoding tuning Pull Request resolved: https://github.com/pytorch/pytorch/pull/141453 Approved by: https://github.com/drisspg, https://github.com/leslie-fang-intel Co-authored-by: Wu, Chunyuan <chunyuan.wu@intel.com>	2024-12-09 18:44:39 +00:00
Robert Hardwick	a0d49dc047	Fix to make GELU on aarch64 preserve COW input tensor (#142366 ) Fixes #142365 itensor_from_tensor call was causing COW tensor to materialize Pull Request resolved: https://github.com/pytorch/pytorch/pull/142366 Approved by: https://github.com/malfet	2024-12-09 18:42:06 +00:00
Nikita Shulga	0610b9730e	Do not use builder repo for MacOS builds (#142277 ) Added `c7564f31f7/wheel/build_wheel.sh` to `.ci/wheel/` folder Commented out call to `39532891a0/run_tests.sh`, because since 2018 this script just checked that tests folder is there and exited, as there are no way to run all pytorch tests in single shard, see this logic: ```bash #!/bin/bash set -eux -o pipefail # Essentially runs pytorch/test/run_test.py, but keeps track of which tests to # skip in a centralized place. # # TODO Except for a few tests, this entire file is a giant TODO. Why are these # tests # failing? # TODO deal with Windows # This script expects to be in the pytorch root folder if [[ ! -d 'test' \|\| ! -f 'test/run_test.py' ]]; then echo "builder/test.sh expects to be run from the Pytorch root directory " \ "but I'm actually in $(pwd)" exit 2 fi # Allow master skip of all tests if [[ -n "${SKIP_ALL_TESTS:-}" ]]; then exit 0 fi ``` https://github.com/pytorch/pytorch/pull/123390 is a misread attempt to interpret above-mentioned logic, as run_tests will be skipped if `${SKIP_ALL_TESTS}` is a non-empty string Pull Request resolved: https://github.com/pytorch/pytorch/pull/142277 Approved by: https://github.com/huydhn, https://github.com/atalman ghstack dependencies: #142276	2024-12-09 18:33:58 +00:00
Fabian Keller	5e8e1d725a	Remove some unused type ignores (round 1) (#142325 ) Over time, a large number of the existing type ignores have become irrelevant/unused/dead as a result of improvements in annotations and type checking. Having these `# type: ignore` linger around is not ideal for two reasons: - They are verbose/ugly syntatically. - They could hide genuine bugs in the future, if a refactoring would actually introduce a bug but it gets hidden by the ignore. I'm counting over 1500 unused ignores already. This is a first PR that removes some of them. Note that I haven't touched type ignores that looked "conditional" like the import challenge mentioned in https://github.com/pytorch/pytorch/pull/60006#issuecomment-2480604728. I will address these at a later point, and eventually would enable `warn_unused_ignores = True` in the mypy configuration as discussed in that comment to prevent accumulating more dead ignores going forward. This PR should have no effect on runtime at all. Pull Request resolved: https://github.com/pytorch/pytorch/pull/142325 Approved by: https://github.com/Skylion007, https://github.com/janeyx99	2024-12-09 18:23:46 +00:00
cuichengyi	a52d9f6f4c	Fix `torch.lerp` RuntimeError when `weight` is CPU scalar while `input` & `end` are CUDA tensor (#141820 ) Fixes #141811 Pull Request resolved: https://github.com/pytorch/pytorch/pull/141820 Approved by: https://github.com/eqy, https://github.com/janeyx99	2024-12-09 18:14:54 +00:00
Ke Wen	d99c9c2acb	[PGNCCL] Make sure we do not use split for P2P comm creation (#139013 ) Resolve comment https://github.com/pytorch/pytorch/pull/138527#issuecomment-2438613172 There was a split-vs-P2P bug: When P2P comm creation invokes `getNCCLComm`, it may see a `split_from` options which is meant for the previous PG creation. Then the P2P comm creation may use `ncclCommSplit` and hang, because not all ranks join this call. The bug slips previously/today because there is no CI test with the following recipe: eager init + new group + P2P in that new group. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139013 Approved by: https://github.com/shuqiangzhang	2024-12-09 17:56:03 +00:00
PyTorch MergeBot	219e9c83a5	Revert "[AOTI XPU] Support AOT Inductor for Intel GPU. (#140269 )" This reverts commit 854d83133bd4b0bca8ba19477c56ef2dd896dfc7. Reverted https://github.com/pytorch/pytorch/pull/140269 on behalf of https://github.com/clee2000 due to breaks forward compatibility? D66937097 ([comment](https://github.com/pytorch/pytorch/pull/140269#issuecomment-2528828555))	2024-12-09 17:33:28 +00:00
PyTorch MergeBot	6fcb294e18	Revert "[AOTI] Fix #140546 and support AOTI package load for Intel GPU. (#140664 )" This reverts commit 91d30546a4338b17f31d31a674662aa53d61b1aa. Reverted https://github.com/pytorch/pytorch/pull/140664 on behalf of https://github.com/clee2000 due to breaks forward compatibility? D66937097 ([comment](https://github.com/pytorch/pytorch/pull/140269#issuecomment-2528828555))	2024-12-09 17:33:28 +00:00
PyTorch MergeBot	90fc2b42e3	Revert "export AOTI_TORCH_EXPORT on Windows. (#140030 )" This reverts commit 82544bd3a2f71e5995e6b035433139fad884e277. Reverted https://github.com/pytorch/pytorch/pull/140030 on behalf of https://github.com/clee2000 due to still has failures internally when building, D66923759 ([comment](https://github.com/pytorch/pytorch/pull/140030#issuecomment-2528760716))	2024-12-09 17:04:20 +00:00
andrewor14	dd5df002b9	[pt2e][quant] Make move_exported_model_to_train/eval idempotent (#142239 ) Summary: Before we would recompile the model unnecessarily even if the model is already in the desired mode. For training frameworks that assume `model.train()` is idempotent and calls this before every single training step, this led to a bunch of tiny graphs and poor performance. This commit makes these calls no-ops if we're already in the target train/eval mode. Test Plan: python test/test_quantization -k TestQuantizePT2E.test_allow_exported_model_train_eval_idempotent Pull Request resolved: https://github.com/pytorch/pytorch/pull/142239 Approved by: https://github.com/jerryzh168	2024-12-09 16:50:20 +00:00
Edward Z. Yang	d29e0ac9e9	Use set -o pipefail for build.sh (#142377 ) This would have made https://github.com/pytorch/pytorch/pull/142359 a hard failure. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/142377 Approved by: https://github.com/huydhn, https://github.com/malfet	2024-12-09 16:30:57 +00:00
Huy Do	9b6ef8abaf	Update inductor jobs to use CUDA 12.4 (#142177 ) CUDA 12.4 is the default now. This frees up some resources. This also fixes newly added Python 3.13 job by #140733. That PR missed adding the new Docker image `pytorch-linux-focal-cuda12.4-cudnn9-py3.13-gcc9-inductor-benchmarks` into docker build workflow. Pull Request resolved: https://github.com/pytorch/pytorch/pull/142177 Approved by: https://github.com/atalman	2024-12-09 16:18:38 +00:00
Aditya Tewari	02848c2e14	[cpu/aarch64] fix compilation for Vec:bf16 (128bit) (#142370 ) Fix typo causing compilation error on aarch64 architecture with BF16 support. (#139090) tag: @swolchok Pull Request resolved: https://github.com/pytorch/pytorch/pull/142370 Approved by: https://github.com/Skylion007, https://github.com/malfet	2024-12-09 16:17:32 +00:00
Huy Do	a17ecd8668	Add missing bc CI dependency (#142359 ) The [bc](https://www.geeksforgeeks.org/bc-command-linux-examples) command that I use to calculate the MAX_JOBS in https://github.com/pytorch/pytorch/pull/142164 isn't part of the Docker image https://github.com/pytorch/pytorch/actions/runs/12230618287/job/34113698986#step:14:321. I missed this error when landing https://github.com/pytorch/pytorch/pull/142164. Pull Request resolved: https://github.com/pytorch/pytorch/pull/142359 Approved by: https://github.com/Skylion007	2024-12-09 16:07:40 +00:00
snordmann	1589c2bc4b	[c10d][UCC] Add `_reduce_scatter_base` to `c10d::ProcessGroupUCC` (#138021 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138021 Approved by: https://github.com/kwen2501	2024-12-09 16:02:24 +00:00
Ting Lu	8d9ac9d94e	[aarch64] Fix libcusparselt format for CUDA sbsa docker (#142363 ) Corrects https://github.com/pytorch/pytorch/pull/141433/files Error whe building arm wheel https://github.com/pytorch/pytorch/actions/runs/12226514901/job/34101913511 `/opt/rh/gcc-toolset-11/root/usr/bin/ld: /usr/local/cuda/lib64/libcusparseLt.so: error adding symbols: file in wrong format` Pull Request resolved: https://github.com/pytorch/pytorch/pull/142363 Approved by: https://github.com/Aidyn-A, https://github.com/Skylion007, https://github.com/atalman	2024-12-09 15:29:36 +00:00
Bin Bao	5fc9f419ef	[AOTI] Fix multi-kernel codegen when using one-pass (#142333 ) Summary: Update multi-kernel codegen to one-pass, following https://github.com/pytorch/pytorch/pull/141980. Differential Revision: [D66936717](https://our.internmc.facebook.com/intern/diff/D66936717) Pull Request resolved: https://github.com/pytorch/pytorch/pull/142333 Approved by: https://github.com/chenyang78 ghstack dependencies: #141980	2024-12-09 14:49:10 +00:00
Bin Bao	4d43ec2189	[AOTI] Swith GPU codegen to one-pass (#141980 ) Summary: With autotune_at_compile_time enabled, AOTI now can perform CUDA codegen in one pass. CUDA kernel related code is generated in a deferred way, after autotuning is done. This one-pass implementation will eliminate any issue caused by disparity between passes in the previous two-pass implementation (which caused multiple bug reports in the past). One-pass implementation also avoids cloning mutated inputs needed in the two-pass implementation, which will reduce GPU memory consumption. Differential Revision: [D66739414](https://our.internmc.facebook.com/intern/diff/D66739414) Pull Request resolved: https://github.com/pytorch/pytorch/pull/141980 Approved by: https://github.com/chenyang78	2024-12-09 14:40:34 +00:00
PyTorch UpdateBot	f14ce3a923	Update slow tests (#140248 ) This PR is auto-generated weekly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/weekly.yml). Update the list of slow tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/140248 Approved by: https://github.com/pytorchbot	2024-12-09 11:15:51 +00:00
PyTorch MergeBot	7101dcfb98	Revert "[inductor][cpp] Add FlexAttention support for CPU inference (#141453 )" This reverts commit 7edbde3334df3223c009769d8226d06071e1fff9. Reverted https://github.com/pytorch/pytorch/pull/141453 on behalf of https://github.com/huydhn due to Sorry for reverting your change but I think it is failing periodic NO_AVX2 ([comment](https://github.com/pytorch/pytorch/pull/141453#issuecomment-2527377475))	2024-12-09 09:26:20 +00:00
cyyever	a108b282ff	[4/N] Avoid copy in std::get (#142285 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/142285 Approved by: https://github.com/Skylion007 Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>	2024-12-09 07:59:35 +00:00
Xia, Weiwen	2cc01cc6d3	[Quant][Inductor][X86] add fusion pass for linear_dynamic_fp16 with relu (#141556 ) Description Fuse and prepack weight for `linear_dynamic_fp16` with post op relu. In Inductor, the pattern we see is ``` fp32 activation \| (reshape) \| mm/addmm <- t <- to_fp32 <- tp_fp16 <- weight \| (reshape) <- relu ``` Or ``` fp32 activation \| expand \| bmm <- expand <- t <- to_fp32 <- tp_fp16 <- weight \| (add) <- relu ``` The second pattern is for x.ndim > 2 and x is not contiguous. The first pattern is for other cases. Fuse the pattern with weight prepack, and we get ``` fp32 activation \| onednn.linear_relu_dynamic_fp16 <- onednn.linear_prepack_fp16 <- weight ``` After freezing, the prepack op is gone. Test plan ``` python test/inductor/test_mkldnn_pattern_matcher.py -k test_linear_relu_dynamic_fp16 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/141556 Approved by: https://github.com/jgong5, https://github.com/jerryzh168 ghstack dependencies: #141549	2024-12-09 05:05:11 +00:00
Nikita Shulga	7435f57f60	[BE] Remove unusued `channels` arg in col2im (#142336 ) Number of channels is passed to col2im kernel/device function, but is not used during the computations at all Pull Request resolved: https://github.com/pytorch/pytorch/pull/142336 Approved by: https://github.com/Skylion007, https://github.com/eqy	2024-12-09 01:49:41 +00:00
drisspg	75e72e1408	Adding lowering to persistent-tma device kernel for _scaled_mm (#142045 ) # Summary This PR adds an alternative triton lowering for _scaled_mm. This uses an updated mm template that utilizes persistent scheduling + TMAs on A and B matrices. Limitations: * This implementations does not work with Bias values: `0602676c8d/torch/_inductor/kernel/mm_scaled.py (L106)` Plan is to remove this work around and enforce that both scaling + bias is properly done as epilogues onto the existing templates * K dim must be 32 or greater for these to take effect * Gated by a config flag ( currently defaults to Off, maybe should be on) ## Testing We dont have any tests exercising this code in CI/CD but I updated the relevant tests in test_fp8 and they are all green: <img width="1680" alt="Screenshot 2024-12-05 at 7 24 07 PM" src="https://github.com/user-attachments/assets/9c520541-d97a-416f-9af7-e68b366ec90f"> ## Follow Ups * Work to update the base mm triton templates and utilize the same template from mm/addmm/scaled_mm w/ respective epilogues * Tuning on Persistent kernel configs. I found ones that work for my problem shapes but need to do some more NCU work ### Some profiling code I was using Code I am using to iterate w/ ```Python import torch from dataclasses import dataclass from jsonargparse import CLI import logging from pathlib import Path from transformer_nuggets.utils.benchmark import ProfileConfig, profile_function from torchao.float8.inference import ( addmm_float8_unwrapped_inference, preprocess_data, Float8MMConfig, ) from transformer_nuggets.fp8.fp8_matmul import ( matmul_persistent, matmul_tma_persistent, matmul_device_tma_persistent, ) from enum import Enum logging.getLogger("transformer_nuggets").setLevel(logging.INFO) class FP8Kernel(Enum): PERSISTENT = "Persistent" PERSISTENT_TMA = "Persistent-TMA" DEVICE_TMA = "Device-TMA" SCALED_MM = "Scaled-MM" class ScalingStrategy(Enum): PER_TENSOR = "PerTensor" PER_ROW = "PerRow" @dataclass(frozen=True) class ExperimentConfig: M: int K: int N: int scaling_strategy: ScalingStrategy fp8_kernel: FP8Kernel compile: bool def get_fp8_matmul( A: torch.Tensor, B: torch.Tensor, scaling_strategy: ScalingStrategy, fp8_kernel: FP8Kernel, ): A_fp8 = A.to(torch.float8_e4m3fn) B_fp8 = B.to(torch.float8_e4m3fn) A_fp8, B_fp8 = preprocess_data(A_fp8, B_fp8, Float8MMConfig(use_fast_accum=True)) if scaling_strategy == ScalingStrategy.PER_TENSOR: a_scale = torch.tensor(1, device="cuda", dtype=torch.float32) b_scale = torch.tensor(1, device="cuda", dtype=torch.float32) elif scaling_strategy == ScalingStrategy.PER_ROW: a_scale = torch.ones((A_fp8.size(0), 1), device="cuda", dtype=torch.float32) b_scale = torch.ones((B_fp8.size(1), 1), device="cuda", dtype=torch.float32).T else: raise ValueError(f"Invalid scaling strategy: {scaling_strategy}") assert fp8_kernel == FP8Kernel.SCALED_MM return lambda: addmm_float8_unwrapped_inference( A_fp8, a_scale, B_fp8, b_scale, output_dtype=torch.bfloat16, use_fast_accum=True ) def run_matmul(config: ExperimentConfig): device = torch.device("cuda" if torch.cuda.is_available() else "cpu") A = torch.randn(config.M, config.K, device=device, dtype=torch.bfloat16) B = torch.randn(config.K, config.N, device=device, dtype=torch.bfloat16) fp8_matmul = get_fp8_matmul(A, B, config.scaling_strategy, config.fp8_kernel) if config.compile and config.fp8_kernel == FP8Kernel.SCALED_MM: fp8_matmul = torch.compile(fp8_matmul, mode="max-autotune-no-cudagraphs") _ = fp8_matmul() return def main(): torch.random.manual_seed(123) # Define your experiment configuration here config = ExperimentConfig( M=8192, K=8192, N=8192, scaling_strategy=ScalingStrategy.PER_TENSOR, fp8_kernel=FP8Kernel.SCALED_MM, compile=True, ) run_matmul(config) if __name__ == "__main__": CLI(main) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/142045 Approved by: https://github.com/eellison	2024-12-09 01:48:40 +00:00
gasoonjia	29e985b7b0	[dim_order] raised runtime error when tensor has ambiguous dim order (#141632 ) This diff makes tensor.dim_order() raise error when tensor's dim order is ambiguous. Detail discussion can be found https://fb.workplace.com/groups/894363187646754/permalink/2039987243084337/ Differential Revision: [D65133579](https://our.internmc.facebook.com/intern/diff/D65133579/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/141632 Approved by: https://github.com/larryliu0820	2024-12-08 23:16:57 +00:00
Xuehai Pan	e1196dfe51	Deprecate `torch._utils.is_compiling()` (#127690 ) This PR is split from PR #126898. - #126898 ------ Pull Request resolved: https://github.com/pytorch/pytorch/pull/127690 Approved by: https://github.com/Skylion007, https://github.com/malfet	2024-12-08 22:55:36 +00:00
Tom Ritchford	869665c44c	[torchgen] Fix an unused variable in api/python.py (#142337 ) Extracted from https://github.com/pytorch/pytorch/pull/136359 Changes behavior, but the original code seems like it was an obvious oops. Pull Request resolved: https://github.com/pytorch/pytorch/pull/142337 Approved by: https://github.com/Skylion007	2024-12-08 21:48:08 +00:00
atalman	ef26f1c57e	Migrate windows build scripts from builder to pytorch (#142156 ) Move builder windows build scripts to pytorch/pytorch Remove builder checkout during windows build Pending remove windows build scripts https://github.com/pytorch/builder/tree/main/windows Pull Request resolved: https://github.com/pytorch/pytorch/pull/142156 Approved by: https://github.com/malfet, https://github.com/chuanqi129	2024-12-08 21:43:59 +00:00
Huy Do	05c1f37188	Enable memory swap on Linux docker build (#142293 ) Lots of CUDA build jobs are OOM-ing in trunk and the hotspot seems to come from building flash attention, for example https://github.com/pytorch/pytorch/actions/runs/12208390090/job/34061532155#step:14:9369. There are several options around: * Mimic the logic from https://github.com/Dao-AILab/flash-attention/blob/main/setup.py#L495-L508. We are using `linux.2xlarge` for the build with 8 CPU and 16GB. The current max number of parallel jobs is `(8 - 2)/3 = 2` while the logic from upstream repo has `16 / 9 = 1.7`, so it's very close. * Upgrade to `linux.2xlarge.memory` with 8 CPU and 64GB for all CUDA build, it could afford up to 7 max parallel jobs according to the above logic. * Enable swap. These approaches can work together, so I want to experiment with swapping first as this technique, if working, could be useful in other context too. Pull Request resolved: https://github.com/pytorch/pytorch/pull/142293 Approved by: https://github.com/seemethere, https://github.com/malfet	2024-12-08 20:59:43 +00:00
Richard Barnes	46dc2965de	Adding missing space to pybind_utils.h error message (#142258 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/142258 Approved by: https://github.com/Skylion007	2024-12-08 20:46:32 +00:00
Blaine Burton Rister	0c66cee9a2	[Inductor] Expand dtype aware codegen for libdevice and tl.math ops (#140864 ) # Feature Previously, only the codegen for `torch.sqrt` was dtype aware. This PR updates most of the `libdevice`/`tl.math` ops to support dtype-aware codegen as well. This is often necessary to get correct code when `config.triton.codegen_upcast_to_fp32=False`, as most Triton math ops do not support float16/bfloat16. This PR enables dtype aware codegen via the `maybe_upcast_float32` decorator. This wraps `TritonOverrides` macros to upcast arguments to float32, and downcast the result back to the original dtype. The exception is for ops that return booleans, in which case we set `convert_output=False` and skip the output cast. # Test Plan Added CI tests for all the new ops. The list of ops to test is automatically generated based on uses of the `maybe_upcast_float32` decorator, and stored in the new `OpDtypeSupport` class. In each new test, we search the generated code for upcasts/downcasts using a regex. Also added a unit test for `OpDtypeSupport` which checks that we have correct dtype info for ops that require upcasts. This PR also moves some existing tests around, to collect all the dtype aware codegen tests in one file. Pull Request resolved: https://github.com/pytorch/pytorch/pull/140864 Approved by: https://github.com/eellison, https://github.com/arui-meta Co-authored-by: eellison <elias.ellison@gmail.com>	2024-12-08 19:42:48 +00:00
叶家希	c814dd08aa	Fixed installing dependencies instructions in `CONTRIBUTING.md` (#142334 ) In the original code, “pip install -r requirements” missed the suffix “.txt”, so I'll add it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/142334 Approved by: https://github.com/malfet	2024-12-08 19:35:36 +00:00
Jason Ansel	e343f46464	[inductor] Refactor is_big_gpu (#142220 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/142220 Approved by: https://github.com/yanboliang ghstack dependencies: #142219, #142033, #142222	2024-12-08 18:51:36 +00:00
Tom Ritchford	dc7461d6f5	`docstring_linter` finds long classes and functions without docstrings (#140426 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/140426 Approved by: https://github.com/eellison	2024-12-08 17:03:57 +00:00
Tom Ritchford	d0b9874603	Teach ruff_linter to report syntax errors (fix #140228 ) (#142312 ) Now syntax errors look like this: ``` torch/_dynamo/variables/base.py:20: Error (RUFF) E999 SyntaxError: Expected ',', found indent. See https://beta.ruff.rs/docs/rules/ >>> 19 \|class SourceType(Enum]: 20 \| """ 21 \| This Enum divides VariableTracker into 2 cases, depending on the variable 22 \| it represents: [...more errors...] ``` Note that most syntax errors lead to a cascade of other errors, so the exception is generally wrong, but the location and name are good. Before they looked like this: ``` >>> General linter failure: Error (RUFF) Linter failed Linter failed. This a bug, please file an issue against the linter maintainer. CONTEXT: Linter command failed with non-zero exit code. STDERR: <MainThread:DEBUG> $ /home/rec/.conda/envs/pytorch-dev-constant/ bin/python3 -m ruff check --exit-zero --quiet --output-format=json --config=pyproject.toml /home/rec/git-constant/pytorch/torch/_dynamo/ variables/base.py <MainThread:DEBUG> took 38ms Traceback (most recent call last): File "/home/rec/git-constant/pytorch/tools/linter/adapters/ ruff_linter.py", line 465, in <module> main() File "/home/rec/git-constant/pytorch/tools/linter/adapters/ ruff_linter.py", line 424, in main lint_messages = check_files( File "/home/rec/git-constant/pytorch/tools/linter/adapters/ ruff_linter.py", line 273, in check_files return [ File "/home/rec/git-constant/pytorch/tools/linter/adapters/ ruff_linter.py", line 288, in <listcomp> severity=severities.get(vuln["code"], get_issue_severity(vuln["code"])), File "/home/rec/git-constant/pytorch/tools/linter/adapters/ ruff_linter.py", line 172, in get_issue_severity if any( File "/home/rec/git-constant/pytorch/tools/linter/adapters/ ruff_linter.py", line 173, in <genexpr> code.startswith(x) AttributeError: 'NoneType' object has no attribute 'startswith' STDOUT: ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/142312 Approved by: https://github.com/Skylion007	2024-12-08 16:48:05 +00:00
Bin Bao	2c6d094869	[AOTI] Assert misaligned input (#142136 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/141891. JIT Inductor relies on copy_misaligned_inputs to fix misaligned inputs. For AOTInductor's use scenario, this is an unacceptable performance hit, so we codegen input alignment check at the entry point and throws an error if any misalignment exists. Differential Revision: [D66881038](https://our.internmc.facebook.com/intern/diff/D66881038) Pull Request resolved: https://github.com/pytorch/pytorch/pull/142136 Approved by: https://github.com/eellison, https://github.com/ezyang ghstack dependencies: #142133	2024-12-08 15:13:01 +00:00
Bin Bao	5035ff0796	[AOTI] Refactor codegen_inputs signature (#142133 ) Summary: Since codegen_inputs only writes to self.prefix, drop IndentedBuffer from its parameters, to make the API consistent with other similar functions. Differential Revision: [D66881040](https://our.internmc.facebook.com/intern/diff/D66881040) Pull Request resolved: https://github.com/pytorch/pytorch/pull/142133 Approved by: https://github.com/chenyang78	2024-12-08 15:05:03 +00:00
atalman	32b94644fc	Add manylinux_2_28_x86_64 tags to wheel builds (#141988 ) Tag the Wheels with appropriate Manylinx 2.28 tags Initially used auditwheel but it does much more that just adding tags. It also tries to package multiple libs into wheel, which we don't want at this point. Hence just changed tag and filename. If no librs are repackages by auditwheel all ti does is to tag and rename. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141988 Approved by: https://github.com/kit1980, https://github.com/malfet	2024-12-08 14:36:24 +00:00
chuanqiw	0ecba57561	[CI] Add xpu new docker image name into docker builds workflow (#142298 ) Add missed new xpu docker image name to adapt the new mechanism introduced by https://github.com/pytorch/test-infra/pull/6013 Works for https://github.com/pytorch/pytorch/issues/114850 Pull Request resolved: https://github.com/pytorch/pytorch/pull/142298 Approved by: https://github.com/huydhn	2024-12-08 09:34:20 +00:00
jianan-gu	7edbde3334	[inductor][cpp] Add FlexAttention support for CPU inference (#141453 ) This PR brings the FlexAttention inference support for the inductor backend in torch.compile (support precisions: bf16 and fp32) on CPUs. Based on the existing CPP template, this PR extends and implements a FlexAttention CPP template to support broad attention variants, and meanwhile brings optimized performance on CPUs. With this, users can transparently extend their Flex Attention usages to CPUs with good and common support from torch.compile, both functionality and performance. For UT tests, in this PR, we include partial critical tests for CPUs as the following (conduct inference tests): ``` pytest test/inductor/test_flex_attention.py `TestFlexAttention` #common functions: run_test preprocess_paged_attention run_paged_attention run_test_with_paged_attention run_test_with_call run_dynamic_test run_automatic_dynamic_test #test functions: test_builtin_score_mods test_builtin_score_mods_automatic_dynamic test_builtin_score_mods_different_seqlen test_builtin_score_mods_different_block_size test_kv_batch_broadcast test_GQA test_cpu_error_message_return_lse test_validate_cpu_dtype_error_message `TestPagedAttention` #test function: test_paged_builtin_score_mods ``` For the rest UTs in `test/inductor/test_flex_attention.py ` and `test/inductor/test_flex_decoding.py`, due to bigger lines of changes (1500+ LOC) that make this PR hard to review, will submit another PR specific for CPU device UTs enabling and refactor. Besides, more optimizations are also planned in follow up PRs, including: - Block sparse computation - Flash decoding tuning Pull Request resolved: https://github.com/pytorch/pytorch/pull/141453 Approved by: https://github.com/jgong5, https://github.com/drisspg, https://github.com/leslie-fang-intel Co-authored-by: Wu, Chunyuan <chunyuan.wu@intel.com>	2024-12-08 07:57:21 +00:00
Xuehai Pan	0bd7b7ae58	Add version check for C++ pytree availability (#142299 ) Resolves #142256 Pull Request resolved: https://github.com/pytorch/pytorch/pull/142299 Approved by: https://github.com/jansel, https://github.com/weifengpy	2024-12-08 06:27:32 +00:00
Aaron Gokaslan	2682e5e0d4	[BE]: Add TypeGuard to is_symbolic (#142304 ) Improves type inference for is_symbolic. If it's True, it must be either a SymInt or Torch Tensor currently. Pull Request resolved: https://github.com/pytorch/pytorch/pull/142304 Approved by: https://github.com/jansel	2024-12-08 02:18:17 +00:00
Nichols A. Romero	2fc8bac091	[ROCm] Fix unit test: matmul_offline_mgpu_gpu_tunableop (#142269 ) Fixes #141652 This PR fixes (at least in part) the unit test failure. However, we may also need to do a separate flush of the untuned results-- if this test continues to be flaky, another PR would be needed to flush the untuned results as well. Tested locally and it seems to be working. Also fixing code that was accidentally commented out code in the unit test from the prior multi-gpu offline tuning PR https://github.com/pytorch/pytorch/pull/139673 Pull Request resolved: https://github.com/pytorch/pytorch/pull/142269 Approved by: https://github.com/jeffdaily	2024-12-08 02:18:00 +00:00
Richard Barnes	b1bb860d3c	c10::string_view -> std::string_view in aten (#141903 ) D66560348 passes internally, but won't export, so I'm rebuilding here. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141903 Approved by: https://github.com/Skylion007	2024-12-07 23:23:52 +00:00
Fabian Keller	8cb68b136f	Proper modeling of recursive types (#142300 ) Currently there are a few type annotations that falsely state that mypy doesn't support recursive types. Recursive type support is available in mypy for a few years already. It has been officially enabled in [version 0.991](https://mypy-lang.blogspot.com/2022/11/mypy-0990-released.html). Pyright even had support for recursive types earlier (https://github.com/microsoft/pyright/issues/569), so there is probably no reason not to model these types correctly. This PR models these types properly now. Since this has turned a few implicit `Any` into fully typed variables that are not narrowed cleanly, a small number of type ignores were necessary. Note that regarding the `Argument` it is desirable to model it in a covariant way (i.e. using `Sequence` and `Mapping`) instead of making it invariant unnecessarily (using `List` and `Dict`). If it were modeled invariant, it would for instance mean that a `List[Node]` would not type check as `Argument`, because invariance would mean that it really has to be a `List[Argument]` (i.e., including all the branches of the union type). Since even the name of the type "argument" strongly suggest that it is semantically used as "argument", having covariance natural anyway. There are no chances in this PR that affect runtime behavior. CC @Skylion007 Pull Request resolved: https://github.com/pytorch/pytorch/pull/142300 Approved by: https://github.com/ezyang, https://github.com/Skylion007	2024-12-07 21:30:45 +00:00
Richard Barnes	17f1a42c13	Add missing py::bytes to pybind_utils tryToInferType (#142265 ) I'm not sure what the best way to fix this is, but this does unbreak an internal test. Test Plan: Sandcastle Reviewed By: itamaro Pull Request resolved: https://github.com/pytorch/pytorch/pull/142265 Approved by: https://github.com/houseroad	2024-12-07 20:31:57 +00:00
Chirag Pandya	3b531f18c7	[BE] Improve Flight Recorder efficacy (#142178 ) Summary: This is an attempt to improve the flight recorder efficacy. We have a small subset of jobs that are timing out (i.e. failing to write out FR logs in 1 minute) and some that are throwing a `std::exception - broken promise`. There are two changes in here. 1. We attempt to write out FR buffer with stack traces. If this fails, we attempt to capture FR buffer again - but this time without stack traces. The assumption here is that FR could be locking up when unwinding stack. Note, to keep things simple, I'm re-using the same file name for both with/without stack_trace. 2. Add additional catch statements in the Manifold writer. There might be something going on in here - so we'll get a log statement if this is failing. TODO: - there's nothing differentiating in the output that says whether stack traces were omitted purposefully or not. This info might be useful for the analyzer - so I'll add this in a follow on diff. Test Plan: Unit tests. Differential Revision: D66843194 Pull Request resolved: https://github.com/pytorch/pytorch/pull/142178 Approved by: https://github.com/kwen2501	2024-12-07 19:32:28 +00:00
xinan.lin	91d30546a4	[AOTI] Fix #140546 and support AOTI package load for Intel GPU. (#140664 ) Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): * #140686 * __->__ #140664 * #140269 * #140268 * #135320 * #135318 * #139026 Fix #140546 Pull Request resolved: https://github.com/pytorch/pytorch/pull/140664 Approved by: https://github.com/desertfire, https://github.com/EikanWang ghstack dependencies: #140268, #140269	2024-12-07 19:22:04 +00:00
xinan.lin	854d83133b	[AOTI XPU] Support AOT Inductor for Intel GPU. (#140269 ) This PR add XPU support for AOT Inductor, and reuse the corresponding UT. Pull Request resolved: https://github.com/pytorch/pytorch/pull/140269 Approved by: https://github.com/desertfire, https://github.com/EikanWang ghstack dependencies: #140268	2024-12-07 19:22:04 +00:00
xinan.lin	3d227ae315	[Intel GPU] Support getStreamFromExternel for XPU. (#140268 ) In AOT inductor scenario, the GPU Stream can be created outside of the pool of `XPUStream`, and we need to create a `XPUStream` which refers to this stream for the the common logic of AOTI, for example a stream guard is a guard for `XPUStream`. So we add the getStreamFromExternel following the design of CUDAStream. Pull Request resolved: https://github.com/pytorch/pytorch/pull/140268 Approved by: https://github.com/desertfire, https://github.com/jansel, https://github.com/EikanWang	2024-12-07 19:22:04 +00:00
Jason Ansel	843018f407	[inductor] Refactor split factor into V.choices.reduction_split_factor (#142222 ) I want to reuse this for cooperative reduction heuristics (in a later PR). Pull Request resolved: https://github.com/pytorch/pytorch/pull/142222 Approved by: https://github.com/eellison ghstack dependencies: #142219, #142033	2024-12-07 17:48:45 +00:00
Jason Ansel	81edca08ab	[inductor] Refactor some DeviceProperties usage (#142033 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/142033 Approved by: https://github.com/eellison ghstack dependencies: #142219	2024-12-07 17:48:45 +00:00
Jason Ansel	0367a31401	[inductor] Minor typing changes (#142219 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/142219 Approved by: https://github.com/Skylion007, https://github.com/yanboliang	2024-12-07 17:48:37 +00:00
Ting Lu	524395edf4	[aarch64] build cuda 12.6 manywheel dockers (#139988 ) Add Builds sbsa 12.6 manywheel dockers to workflow Related to #138440 Pull Request resolved: https://github.com/pytorch/pytorch/pull/139988 Approved by: https://github.com/atalman Co-authored-by: Andrey Talman <atalman@fb.com>	2024-12-07 15:38:41 +00:00
Xu Han	82544bd3a2	export AOTI_TORCH_EXPORT on Windows. (#140030 ) Fixes #139954 reproduce UT: ```cmd pytest test/inductor/test_torchinductor_codegen_dynamic_shapes.py -k test_device_assert_dynamic_shapes_cpu ``` Issue: <img width="856" alt="image" src="https://github.com/user-attachments/assets/5fc501a9-54e5-45ac-9fb3-509ec11a7abe"> After fixing: ![Image](https://github.com/user-attachments/assets/883846fb-8e92-4b9c-9400-daab32382a3a) Reland: 1. Declare export on Windows explicitly. 2. Support cpu, cuda and xpu devices. Pull Request resolved: https://github.com/pytorch/pytorch/pull/140030 Approved by: https://github.com/jgong5, https://github.com/desertfire	2024-12-07 15:23:38 +00:00
Yu, Guangye	f84e533a2c	Add device-agnostic runtime Device/Stream C++ API (#138677 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138677 Approved by: https://github.com/albanD, https://github.com/EikanWang ghstack dependencies: #133572	2024-12-07 13:14:10 +00:00
Yu, Guangye	952514f0c8	Add UTs for accelerator device-agnostic runtime APIs (#133572 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133572 Approved by: https://github.com/EikanWang, https://github.com/albanD	2024-12-07 13:14:10 +00:00
FindHao	b6a64b64de	Add ncu profile to final output_code.py (#142259 ) This PR adds `--ncu` to the output code benchmark utils to generate ncu profile reports. Test Plan: ``` % python torch_compile_debug/run_2024_12_05_22_27_59_182730-pid_4112931/torchinductor/model__0_forward_1.0/output_code2.py --ncu 0.000160 Peak GPU memory usage 671.220 MB ==PROF== Connected to process 502514 (python3.10) ==PROF== Connected to process 503187 (python3.10) ==WARNING== Unable to access the following 6 metrics: ctc__rx_bytes_data_user.sum, ctc__rx_bytes_data_user.sum.pct_of_peak_sustained_elapsed, ctc__rx_bytes_data_user.sum.per_second, ctc__tx_bytes_data_user.sum, ctc__tx_bytes_data_user.sum.pct_of_peak_sustained_elapsed, ctc__tx_bytes_data_user.sum.per_second. ==PROF== Profiling "distribution_elementwise_grid..." - 0: 0%....50%....100% - 38 passes ==PROF== Profiling "vectorized_elementwise_kernel" - 1: 0%....50%....100% - 38 passes ==PROF== Profiling "triton_poi_fused_embedding_0" - 2: 0%....50%....100% - 38 passes 6.891588 ==PROF== Disconnected from process 502514 ==PROF== Disconnected from process 503187 ==PROF== Report: /tmp/ncu_output_20241206_131245.ncu-rep NCU profiling results for benchmark None: NCU report has been written to /tmp/ncu_output_20241206_131245.ncu-rep ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/142259 Approved by: https://github.com/eellison	2024-12-07 07:54:43 +00:00
PyTorch MergeBot	fc831f76f8	Revert "[Inductor] Represent size_hints as a dict (#142249 )" This reverts commit f870ee2cc4f3dd1babd3043b5291d54f487a2999. Reverted https://github.com/pytorch/pytorch/pull/142249 on behalf of https://github.com/blaine-rister due to would break internal tests ([comment](https://github.com/pytorch/pytorch/pull/142249#issuecomment-2524991008))	2024-12-07 07:43:51 +00:00
Blaine Burton Rister	f870ee2cc4	[Inductor] Represent size_hints as a dict (#142249 ) Preparatory refactor for https://github.com/pytorch/pytorch/pull/137243. # Feature Follow up to https://github.com/pytorch/pytorch/pull/141751. Since we now represent `numels` as a dict, it's natural to extend this to `size_hints`. The latter are basically just the former rounded up to the nearest power of 2. This simplifies various heuristics such as the coordinate descent tuner. Where we previously needed to determine which index in `size_hints` corresponds to each dimension, now we can just query by prefix. This will be especially important when we enable 2D reductions, as it becomes harder to keep track of these things when we have multiple reduction dimensions. (See the previous PR for some examples.) # Test plan The existing CI provides good coverage. This PR modifies a few tests which explicitly constructed size hints. Pull Request resolved: https://github.com/pytorch/pytorch/pull/142249 Approved by: https://github.com/jansel	2024-12-07 06:43:05 +00:00
Ke Wen	a58d2f14e8	[DTensor] Add a private util for sharding tensor (#142288 ) Locally shards a full tensor based on indicated sharding arrangement, and returns a DTensor containing the local shard. warning: This is a private API purposed to skip the communication otherwise required by `distribute_tensor`. It is only applicable to a case where all ranks have the same `full_tensor`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/142288 Approved by: https://github.com/wz337	2024-12-07 05:30:18 +00:00
Basil Wong	2d9b081012	Move knapsack algorithms into separate filing (#141451 ) (#141614 ) Summary: Original Context Doc: https://docs.google.com/document/d/1Gv5ZqN3UY7kTuCUd0JLU3L_onwVIr2vd3AiZyim1Muc/edit?disco=AAABZX_tqdk ### Changes This diff restructures the Partitioners.py file in the _functorch package. * Moves the three knapsack problem algorithems (greedy, ilp, dp) into a separate file Test Plan: ### Unit Testing ``` $ buck test mode/opt //caffe2/test/functorch:test_ac File changed: fbsource//xplat/caffe2/test/functorch/TARGETS File changed: fbsource//xplat/caffe2/test/functorch File changed: fbsource//xplat/caffe2/test 7 additional file change events Soft Error: source_directory_includes_subpackage: Directory `v2.17.1-1` of package `fbsource//third-party/nccl` may not cover any subpackages, but includes subpackage `v2.17.1-1/src/tests`. Soft Error: source_directory_includes_subpackage: Directory `v2.18.3-1` of package `fbsource//third-party/nccl` may not cover any subpackages, but includes subpackage `v2.18.3-1/src/tests`. Soft Error: source_directory_includes_subpackage: Directory `v2.19.3-1` of package `fbsource//third-party/nccl` may not cover any subpackages, but includes subpackage `v2.19.3-1/src/tests`. Buck UI: https://www.internalfb.com/buck2/a2f91f8a-5326-435e-9075-5af0de930b8b Test UI: https://www.internalfb.com/intern/testinfra/testrun/7036874660108924 Network: Up: 28MiB Down: 3.5GiB (reSessionID-19af3d71-7528-448c-9126-5615d27b3bd7) Jobs completed: 423656. Time elapsed: 3:57.2s. Cache hits: 99%. Commands: 146147 (cached: 145758, remote: 317, local: 72) Tests finished: Pass 8. Fail 0. Fatal 0. Skip 0. Build failure 0 ``` ### Tested on Local Training Run ``` CUDA_VISIBLE_DEVICES=5,6 AOT_PARTITIONER_DEBUG=1 PARTITIONER_MEMORY_BUDGET_PARETO=0 buck2 run mode/opt //aps_models/ads/icvr:icvr_launcher -- mode=local_fb_fm_v4 launcher.num_workers=2 2>&1 \| tee log_2024-11-2320:41:50.757256__bento_trigger.txt ``` Output Summary Paste: P1685697066 Differential Revision: D65800097 Pull Request resolved: https://github.com/pytorch/pytorch/pull/141614 Approved by: https://github.com/jansel, https://github.com/Chillee	2024-12-07 03:21:52 +00:00
Xia, Weiwen	c863227be3	[Quant][Inductor][X86] add fusion pass for linear_dynamic_fp16 (#141549 ) Description For `linear_dynamic_fp16`, we insert `quantize` and `dequantize` between x/w and linear to have the following pattern: ``` x \| linear <- to_fp32 <- to_fp16 <- w ``` In Inductor, the pattern we finally see will be ``` fp32 activation \| (reshape) \| mm/addmm <- t <- to_fp32 <- tp_fp16 <- weight \| (reshape) ``` Or ``` fp32 activation \| expand \| bmm <- expand <- t <- to_fp32 <- tp_fp16 <- weight \| (add) ``` The second pattern is for x.ndim > 2 and x is not contiguous. The first pattern is for other cases. Fuse the pattern with weight prepack, and we get ``` fp32 activation \| onednn.linear_dynamic_fp16 <- onednn.linear_prepack_fp16 <- weight ``` After freezing, the prepack op is gone. Test plan ``` python test/inductor/test_mkldnn_pattern_matcher.py -k test_linear_dynamic_fp16 ``` Differential Revision: [D66802159](https://our.internmc.facebook.com/intern/diff/D66802159) Pull Request resolved: https://github.com/pytorch/pytorch/pull/141549 Approved by: https://github.com/jgong5, https://github.com/jerryzh168	2024-12-07 03:08:08 +00:00
Wouter Devriendt	7939b5f5f9	remove sccache from bazel, to go together with #140614 (#142241 ) removes sccache from bazel builds. Will move bazel builds to periodic if build succeed CUDA bazel test succeeded, moving to periodic Pull Request resolved: https://github.com/pytorch/pytorch/pull/142241 Approved by: https://github.com/malfet	2024-12-07 02:08:06 +00:00
PyTorch MergeBot	40d1b5f490	Revert "Respect ROCR_VISIBLE_DEVICES on AMD GPU device discovery (#140320 )" This reverts commit add4a42ea2c56f7687a3564aefe9e017cd118936. Reverted https://github.com/pytorch/pytorch/pull/140320 on behalf of https://github.com/huydhn due to Sorry for reverting your change but test_hip_device_count is failing in trunk after this land ([comment](https://github.com/pytorch/pytorch/pull/140320#issuecomment-2524742845))	2024-12-07 01:28:51 +00:00
Chirag Pandya	db313c87f9	[OSS] Enable Flight Recorder buffer for all (#142260 ) Summary: Enable collecting Flight Recorder data for all. Test Plan: This has been rolled out internally for a while now. Differential Revision: D66897635 Pull Request resolved: https://github.com/pytorch/pytorch/pull/142260 Approved by: https://github.com/kwen2501, https://github.com/fduwjj, https://github.com/wconstab	2024-12-07 01:28:12 +00:00
Andrew Gu	78425bff30	[FSDP2] Move to public `torch.distributed.fsdp` (#141868 ) Overview This PR moves `torch/distributed/_composable/fsdp` to `torch/distributed/fsdp/_fully_shard` and makes public APIs available from `torch.distributed.fsdp`, e.g.: ``` from torch.distributed.fsdp import fully_shard ``` This is targeting 2.6 release. I rewrote some of the documentation with (hopefully) improved phrasing. Changes for Reland - Preserved the public objects from `torch/distributed/_composable/fsdp/fully_shard.py` so that the import path still works internally - Added a unit test that we can do `from torch.distributed._composable.fsdp.fully_shard import FSDPModule` Differential Revision: [D66890387](https://our.internmc.facebook.com/intern/diff/D66890387) Pull Request resolved: https://github.com/pytorch/pytorch/pull/141868 Approved by: https://github.com/kwen2501, https://github.com/wconstab, https://github.com/weifengpy, https://github.com/fegin, https://github.com/XilunWu Co-authored-by: Svetlana Karslioglu <svekars@meta.com>	2024-12-07 01:24:28 +00:00
Angela Yi	868d62552d	[aoti] Add load_constants to package api (#142246 ) Summary: With the changes in https://github.com/pytorch/pytorch/pull/140755 and https://github.com/pytorch/pytorch/pull/141997, I added a load_constants function to the packaging API. Currently this doesn't work for cpu. The workflow is something like: ``` ep = torch.export.export(model, example_inputs) package = torch._inductor.aoti_compile_and_package(ep, inductor_configs=inductor_configs) compiled = torch._inductor.aoti_load_package(package) print(compiled.get_constant_fqns()) # see what are the fqns needed/available compiled.load_constants(new_state_dict, check_full_update=True) # update the constants in AOTI ``` You can also use the `aot_inductor.package_constants_in_so` config to stop including the constants in the so: ``` package = torch._inductor.aoti_compile_and_package(ep, inductor_configs={`aot_inductor.package_constants_in_so`: False) compiled = torch._inductor.aoti_load_package(package) compiled(inputs) # segfaults because there are no constants --> we should probably have a better error msg compiled.load_constants(new_state_dict, check_full_update=True) compiled(inputs) ``` Test Plan: `buck2 run @//mode/dev-nosan //caffe2/test/inductor:aot_inductor_package -- -r "test_so_without_weight" ` Differential Revision: D66796206 Pull Request resolved: https://github.com/pytorch/pytorch/pull/142246 Approved by: https://github.com/henrylhtsang, https://github.com/desertfire	2024-12-07 01:18:42 +00:00
Nikita Shulga	3a3638be50	[BE] Enable Scalar.h compilation on 32-bit system (#142235 ) By hiding ambiguous Scalar(long long) constructor behind `std::enable_if_t<sizeof(void *) == 8>` Followup after https://github.com/pytorch/pytorch/pull/141244 Test Plan: Run `printf "#include <c10/core/Scalar.h>\n c10::Scalar x(3);" \| gcc -x c++ -std=c++17 -I. -Ibuild - -c` on ARMv7 system. Before this change it failed with: ``` In file included from <stdin>:1: ./c10/core/Scalar.h:83:3: error: ‘c10::Scalar::Scalar(long long int)’ cannot be overloaded with ‘c10::Scalar::Scalar(int64_t)’ 83 \| Scalar(long long vv) : Scalar(vv, true) {} \| ^~~~~~ ./c10/core/Scalar.h:50:3: note: previous declaration ‘c10::Scalar::Scalar(int64_t)’ 50 \| Scalar(type vv) : Scalar(vv, true) {} \| ^~~~~~ ./c10/core/ScalarType.h:288:3: note: in expansion of macro ‘DEFINE_IMPLICIT_CTOR’ 288 \| _(int64_t, Long) \ \| ^ ./c10/core/Scalar.h:52:3: note: in expansion of macro ‘AT_FORALL_SCALAR_TYPES_AND7’ 52 \| AT_FORALL_SCALAR_TYPES_AND7( \| ^~~~~~~~~~~~~~~~~~~~~~~~~~~ ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/142235 Approved by: https://github.com/Skylion007	2024-12-07 01:05:55 +00:00
Yue Dong	022cbf2f31	Back out "[Reland][Environment Variable][5/N] Use thread-safe getenv functions (#140594 )" (#142226 ) Summary: Failed to write the auto-tune result to `PYTORCH_TUNABLEOP_FILENAME` with this change (empty file) , reverting to unblock. Test Plan: CI Differential Revision: D66870750 Pull Request resolved: https://github.com/pytorch/pytorch/pull/142226 Approved by: https://github.com/leitian	2024-12-07 00:59:22 +00:00
Nikita Shulga	c6e18a1ed1	[EZ] Remove unused binary_linux_build.sh (#142276 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/142276 Approved by: https://github.com/huydhn	2024-12-07 00:48:56 +00:00
Yifu Wang	716a06d22c	Mark async-tp ops as needs_fixed_stride_order (#142252 ) Inductor seems to not respect the input striding of these ops, which is required for fp8 async-tp and has performance implication on other cases. Pull Request resolved: https://github.com/pytorch/pytorch/pull/142252 Approved by: https://github.com/weifengpy	2024-12-07 00:42:27 +00:00
Nikita Shulga	be16dd678e	[BE][MPS] Fix unused parameter waring Before this change running `xcrun metal -c Indexing.metal -Wall -Wextra -fno-fast-math` resulted in ``` Indexing.metal:75:10: warning: unused parameter 'thread_index' [-Wunused-parameter] uint thread_index [[thread_position_in_grid]]) { ^ 1 warning generated. ``` After no warnings are generated Also, remove redundant semicolons	2024-12-06 16:41:27 -08:00
Avik Chaudhuri	a30bfab224	random dag (#142180 ) Utils for creating random dags and generating code for nn modules based on such dags. In particular, this was used to do fuzz testing for unflatten, where the random dags instructed the generation of calls, const accesses, and buffer mutations in a system of nn modules. Example of generated test: ```python def test_unflatten_random_dag_const_preserving_3(self): class N2(torch.nn.Module): def __init__(self): super().__init__() self.const = torch.ones(1) def forward(self, x): return x + 1 class N1(torch.nn.Module): def __init__(self): super().__init__() self.const = torch.ones(1) self.n2 = N2() def forward(self, x): x = x + self.n2.const x = self.n2(x + 1) return x + 1 class N0(torch.nn.Module): def __init__(self): super().__init__() self.const = torch.ones(1) self.n1 = N1() def forward(self, x): x = x + self.n1.n2.const x = self.n1(x + 1) x = self.n1.n2(x + 1) return x + 1 inp = (torch.ones(1),) eager = N0()(inp) ep = torch.export.export( N0(), inp, strict=False, preserve_module_call_signature=( "n1", "n1.n2", ), ) epm = ep.module() ufm = torch.export.unflatten(ep) assert torch.allclose(epm(inp), eager) assert torch.allclose(ufm(*inp), eager) ``` Differential Revision: [D66838348](https://our.internmc.facebook.com/intern/diff/D66838348/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/142180 Approved by: https://github.com/angelayi, https://github.com/ydwu4	2024-12-07 00:39:43 +00:00
Yifu Wang	0052943bee	[SymmetricMemory] reorganize the op registry (#140763 ) - Separates the definition and implementation - Removes the false pt2_compliant flags Pull Request resolved: https://github.com/pytorch/pytorch/pull/140763 Approved by: https://github.com/weifengpy	2024-12-06 23:59:11 +00:00
James Wu	6e203ae6de	[REFACTOR] Implement AOTDispatchCompiler wrapper (#142205 ) This implements a new wrapper class AOTDispatchCompiler wrapper, which is just a wrapper around a callable that returns an OutputCode. We can then use it in AOTDispatch to decide whether or not to use the cache: if fw_compiler, bw_compiler and inference_compiler are all AOTDispatchCompilers, then we enable caching. This type is pretty close to _CompiledFxGraphCallable, except it's not allowed to take any kwargs. Not sure how to consolidate the two ideas together just yet: unfortunately, there's no way to properly annotate the types to make them related. But a lot of the time, the input to this function will be a partially applied _CompiledFxGraphCallable. This allows the PR above this one to enable AOTAutogradCache everywhere, but not increase instruction count or enable cache on unit tests that use aot_eager or other non inductor compilers. Pull Request resolved: https://github.com/pytorch/pytorch/pull/142205 Approved by: https://github.com/oulgen, https://github.com/bdhirsh	2024-12-06 23:23:20 +00:00
Joel Schlosser	5663ad99e7	Fix per-sample xfails for NJT tests (#142243 ) #140736 fixed some xfails, but these were not properly failing in CI due to #142157. This PR removes the xfails so we can land a fix to that issue. Pull Request resolved: https://github.com/pytorch/pytorch/pull/142243 Approved by: https://github.com/huydhn	2024-12-06 22:39:35 +00:00
Xuan Zhang	ceb94d6a7d	add torchrec collectives to enforce global ordering (#141970 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/141970 Approved by: https://github.com/yf225	2024-12-06 22:38:54 +00:00
UV	7597ab6370	Corrected AMSGrad max equation in Adam and AdamW (#142051 ) Fixes #142041 Pull Request resolved: https://github.com/pytorch/pytorch/pull/142051 Approved by: https://github.com/janeyx99	2024-12-06 21:55:26 +00:00
Xinya Zhang	424156c26c	[ROCm] Update to AOTriton 0.8b (#140172 ) Notable new features for SDPA operators on AMD systems from AOTriton 0.8b: 1. Nestedtensor support; 2. MQA/GQA support; 3. Restore Efficient attention support for causal=True and seqlen_q != seqlen_k cases; + The kernel should use top-left alignment, bottom right alignment will be added later 4. Move gfx1100 (RX7900/W7800/W7900) out of experimental support status. However, users are strongly recommended to update to ROCM 6.2.4, notably for its firmware updates. Related unit tests are enabled as well. Notable related changes from AOTriton 0.8b: 1. AOTriton 0.8b moves the GPU kernel out of libaotriton.so to a separate directory `aotriton.images`; 2. LZMA replaces ZSTD as GPU kernel compression algorithm for better compression ratio: aotriton0.8b (.so + aotriton.images take 350MB) compared to aotriton0.7b .so: 800MB 3. The compression cannot be disabled now, and `liblzma` is hard run-time dependency. + Should not be a problem, since `lzma` is part of Python Standard Library Pull Request resolved: https://github.com/pytorch/pytorch/pull/140172 Approved by: https://github.com/jithunnair-amd, https://github.com/jeffdaily Co-authored-by: Jithun Nair <37884920+jithunnair-amd@users.noreply.github.com>	2024-12-06 21:45:18 +00:00
eqy	0a619a212f	[CUDA] Cleanup per-process-memory-fraction in `test_cuda.py` tests (#140852 ) Otherwise certain sequences of tests will fail with OOM e.g., ``` # python test/test_cuda.py -k max_split_expandable -k test_assigning_back_deleter_fns_to_tensor --repeat 100 .. ---------------------------------------------------------------------- Ran 2 tests in 0.311s OK E. ====================================================================== ERROR: test_assigning_back_deleter_fns_to_tensor (__main__.TestBlockStateAbsorption.test_assigning_back_deleter_fns_to_tensor) ---------------------------------------------------------------------- Traceback (most recent call last): File "/workspace/pytorch/torch/testing/_internal/common_utils.py", line 3058, in wrapper method(args, kwargs) File "/workspace/pytorch/test/test_cuda.py", line 4320, in test_assigning_back_deleter_fns_to_tensor graph, outputs = cudagraphify(foo, [inp]) ^^^^^^^^^^^^^^^^^^^^^^^^ File "/workspace/pytorch/test/test_cuda.py", line 4080, in cudagraphify fn(inputs) File "/workspace/pytorch/test/test_cuda.py", line 4316, in foo int8_cuda(LARGE_BUFFER) + x, ~~~~~~~~~~~~~~~~~~~~~~~~^~~ torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 160.00 MiB. GPU 0 has a total capacity of 31.73 GiB of which 31.30 GiB is free. Process 2916661 has 442.00 MiB memory in use. 120.00 MiB allowed; Of the allocated memory 52.00 MiB is allocated by PyTorch, and 6.00 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) To execute this test, run the following from the base repo dir: python test/test_cuda.py TestBlockStateAbsorption.test_assigning_back_deleter_fns_to_tensor This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 ---------------------------------------------------------------------- Ran 2 tests in 0.136s FAILED (errors=1) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/140852 Approved by: https://github.com/Skylion007	2024-12-06 21:26:54 +00:00
Bin Bao	660845a1aa	[AOTI] Add deprecation warning for torch._export.aot_load (#142212 ) Summary: Add deprecation warning for torch._export.aot_load, and encourage user to move to the new torch._inductor.aoti_load_package. Pull Request resolved: https://github.com/pytorch/pytorch/pull/142212 Approved by: https://github.com/angelayi	2024-12-06 21:12:34 +00:00
PyTorch MergeBot	f36cccba2e	Revert "[Inductor] Expand dtype aware codegen for libdevice and tl.math ops (#140864 )" This reverts commit 80ca6dd892613fd4f1dee9040b8273ddeadb1c50. Reverted https://github.com/pytorch/pytorch/pull/140864 on behalf of https://github.com/atalman due to failing internally ([comment](https://github.com/pytorch/pytorch/pull/140864#issuecomment-2524168602))	2024-12-06 21:03:06 +00:00
cyy	1fa27f6e82	[3/N] Avoid copy in std::get (#141843 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/141843 Approved by: https://github.com/Skylion007	2024-12-06 20:13:36 +00:00
Tal Ben-Nun	add4a42ea2	Respect ROCR_VISIBLE_DEVICES on AMD GPU device discovery (#140320 ) Fixes #140318 Pull Request resolved: https://github.com/pytorch/pytorch/pull/140320 Approved by: https://github.com/eqy, https://github.com/jithunnair-amd, https://github.com/jataylo, https://github.com/jeffdaily Co-authored-by: Jack Taylor <jack.taylor@amd.com>	2024-12-06 20:09:56 +00:00
Max Ren	37c4b19e4d	make sure ukernel prod is everywhere XNNPACK is (#142086 ) Just double checking that ukernel prod (which should be linked with XNNPACK) is in all the places XNNPACK is Pull Request resolved: https://github.com/pytorch/pytorch/pull/142086 Approved by: https://github.com/kirklandsign	2024-12-06 20:09:30 +00:00
Michael Diggin	18ef3a09cd	Add option in data loader for out of order data (#141833 ) Fixes #105203 Facing a similar problem to the linked issue, where variable sized input data can mean that a handful of slow to process samples holds up smaller and faster to process samples from being used. This also leads to lower GPU utilization as well. In certain cases, e.g. evaluation epochs, inference pipelines or other cases where reproducibility isn't important, this can bring significant speed ups. This PR adds an `allow_out_of_order` bool input to the `DataLoader` class, defaulting to `false`, which when set to `true` will returning data from workers in whatever order they are ready/processed in, rather in the strict index order. Instead of storing data that was returned out of order, it is passed directly to the main thread and the entry in `_task_info` is deleted. The main changes are they to check that an entry in `_task_info` does exist, and only increasing `self._rcvd_idx` when the lowest index remaining gets returned. Two tests are added to test this for iterable type datasets and index type datasets. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141833 Approved by: https://github.com/andrewkho	2024-12-06 19:55:58 +00:00
Boyuan Feng	61a7c83c64	[Inductor] fix device error for NopKernelSchedulerNode (#141372 ) This PR adds device guard support for NopKernelSchedulerNode which may create a tensor. Prior to this PR, we do not codegen device guard for NopKernelSchedulerNode, leading to errors. Prior to the PR: ```python def call(args): arg0_1, arg1_1, arg2_1, arg3_1, arg4_1, arg5_1, arg6_1, arg7_1, arg8_1, arg9_1, arg10_1 = args args.clear() assert_size_stride(arg0_1, (1, 1, 2048, 128), (262144, 262144, 128, 1)) assert_size_stride(arg1_1, (1, 1, 2048, 128), (262144, 262144, 128, 1)) assert_size_stride(arg2_1, (1, 1, 2048, 128), (262144, 262144, 128, 1)) assert_size_stride(arg3_1, (1, 1, 16), (16, 16, 1)) assert_size_stride(arg4_1, (1, 1, 16, 16), (256, 256, 16, 1)) assert_size_stride(arg5_1, (1, 1, 16), (16, 16, 1)) assert_size_stride(arg6_1, (1, 1, 16, 16), (256, 256, 16, 1)) assert_size_stride(arg7_1, (1, 1, 16), (16, 16, 1)) assert_size_stride(arg8_1, (1, 1, 16, 16), (256, 256, 16, 1)) assert_size_stride(arg9_1, (1, 1, 16), (16, 16, 1)) assert_size_stride(arg10_1, (1, 1, 16, 16), (256, 256, 16, 1)) buf0 = empty_strided_cuda((1, 1, 2048), (2048, 2048, 1), torch.float32) # TODO: ERROR here. Should be cuda:1 with torch.cuda._DeviceGuard(1): torch.cuda.set_device(1) buf1 = empty_strided_cuda((1, 1, 2048, 128), (262144, 262144, 128, 1), torch.bfloat16) # Topologically Sorted Source Nodes: [flex_attention], Original ATen: [] stream1 = get_raw_stream(1) breakpoint() triton_tem_fused_0.run(arg0_1, arg1_1, arg2_1, buf0, arg3_1, arg4_1, arg5_1, arg6_1, buf1, grid=torch._inductor.kernel.flex_attention.flex_attention_grid(1, 1, 2048, 128, meta0), stream=stream1) del arg0_1 del arg1_1 del arg2_1 del arg3_1 del arg4_1 del arg5_1 del arg6_1 del buf0 return (buf1, ) ``` After the PR: ```python def call(args): arg0_1, arg1_1, arg2_1, arg3_1, arg4_1, arg5_1, arg6_1, arg7_1, arg8_1, arg9_1, arg10_1 = args args.clear() assert_size_stride(arg0_1, (1, 1, 2048, 128), (262144, 262144, 128, 1)) assert_size_stride(arg1_1, (1, 1, 2048, 128), (262144, 262144, 128, 1)) assert_size_stride(arg2_1, (1, 1, 2048, 128), (262144, 262144, 128, 1)) assert_size_stride(arg3_1, (1, 1, 16), (16, 16, 1)) assert_size_stride(arg4_1, (1, 1, 16, 16), (256, 256, 16, 1)) assert_size_stride(arg5_1, (1, 1, 16), (16, 16, 1)) assert_size_stride(arg6_1, (1, 1, 16, 16), (256, 256, 16, 1)) assert_size_stride(arg7_1, (1, 1, 16), (16, 16, 1)) assert_size_stride(arg8_1, (1, 1, 16, 16), (256, 256, 16, 1)) assert_size_stride(arg9_1, (1, 1, 16), (16, 16, 1)) assert_size_stride(arg10_1, (1, 1, 16, 16), (256, 256, 16, 1)) with torch.cuda._DeviceGuard(1): torch.cuda.set_device(1) buf0 = empty_strided_cuda((1, 1, 2048), (2048, 2048, 1), torch.float32) # New: move into device guard buf1 = empty_strided_cuda((1, 1, 2048, 128), (262144, 262144, 128, 1), torch.bfloat16) # Topologically Sorted Source Nodes: [flex_attention], Original ATen: [] stream1 = get_raw_stream(1) triton_tem_fused_0.run(arg0_1, arg1_1, arg2_1, buf0, arg3_1, arg4_1, arg5_1, arg6_1, buf1, grid=torch._inductor.kernel.flex_attention.flex_attention_grid(1, 1, 2048, 128, meta0), stream=stream1) del arg0_1 del arg1_1 del arg2_1 del arg3_1 del arg4_1 del arg5_1 del arg6_1 del buf0 return (buf1, ) ``` Fixes #141010 Pull Request resolved: https://github.com/pytorch/pytorch/pull/141372 Approved by: https://github.com/eellison	2024-12-06 19:27:50 +00:00
PyTorch MergeBot	3fd51e079d	Revert "[Inductor] Constrain the shape of other tensor for Conv/Linear + broadcast add fusion. (#141759 )" This reverts commit 35752cb1ba8324a00b06d72ed388f6437c82c5e5. Reverted https://github.com/pytorch/pytorch/pull/141759 on behalf of https://github.com/atalman due to Failing internally ([comment](https://github.com/pytorch/pytorch/pull/141759#issuecomment-2523983558))	2024-12-06 19:14:36 +00:00
PyTorch MergeBot	db13bd9ac2	Revert "export AOTI_TORCH_EXPORT on Windows. (#140030 )" This reverts commit b8eb4b56d8dbcd07570bec616f7ea58e9dd58fb4. Reverted https://github.com/pytorch/pytorch/pull/140030 on behalf of https://github.com/atalman due to Break internal tests see errors like: csrc\inductor\aoti_torch\shim_common.cpp(481): error C2491: 'aoti_torch__embedding_bag': definition of dllimport function not allowed ([comment](https://github.com/pytorch/pytorch/pull/140030#issuecomment-2523968128))	2024-12-06 19:04:04 +00:00
Eddie Yan	cf58de59d7	[cuBLASLt][Memtracker] Allocate temprorary cuBLASLt workspaces using tensors rather than going to the caching allocator directly (#139442 ) CC @zdevito @janeyx99 This isn't ideal but cuBLASLt workspaces are not currently cached, so this additional untracked allocation will cause `test_cuda_tracker_equivalence` to fail with a large enough workspace size e.g., `CUBLAS_LT_WORKSPACE_SIZE=32768`. One solution is to just use byte-tensors for the workspace instead of going directly to the caching allocator. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139442 Approved by: https://github.com/Aidyn-A, https://github.com/albanD, https://github.com/janeyx99	2024-12-06 19:01:12 +00:00
Mikayla Gawarecki	b7b56576d8	Allow user to manually pass module.name associated with global in {add}_safe_global (#142153 ) Fixes #142144 A global x is saved in checkpoint as `GLOBAL x.__module__ x.__name__`. So , after allowlisting a GLOBAL it is expected to match any GLOBAL instruction of the form `GLOBAL x.__module__ x.__name__` but there are edge cases when for the same API from the same module, what `__module__` gives changes between versions which prevents users from allowlisting the global. In this case, in numpy < 2.1 ``` torch.save("bla", np_array) # checkpoint has GLOBAL "np.core.multiarray" "_reconstruct" ``` In np version 2.1 ``` with safe_globals([np.core.multiarray._reconstruct]): torch.load("bla") ``` np.core.multiarray._reconstruct.__module__ gives "np._core.multiarray" (note the extra _ before core) and see what was done [here](https://github.com/numpy/numpy/blob/main/numpy/core/multiarray.py) Since the dictionary to access safe globals is keyed on "{foo.__module__}.{foo.__name__}", __module__, __name__ will no longer match that in the checkpoint so "np.core.multiarray._reconstruct" can no longer be properly allowlisted (instead np._core.multiarray._reconstruct is a key in the dict). We allow `add_safe_globals/safe_globals` to optionally take tuples of (global, str of module.name) to workaround such (odd/edge case) situations. Pull Request resolved: https://github.com/pytorch/pytorch/pull/142153 Approved by: https://github.com/albanD	2024-12-06 18:56:39 +00:00
Zhengxu Chen	1a7da6e7e9	[export] Add test to enforce consistency between synced thrift and generated thrift from schema.py (#141989 ) Summary: In this diff we implement a way to ensure the internal thrift schema from cfgr (configerator/structs/caffe2/torch/export/schema.thrift) and the schema in OSS (torch/_export/serde/schema.thrift) are in sync, by adding a unittest to reflect on the type names and fields from each schema and compare them field by field. When we detect new fields/types from torch/_export/serde/schema.thrift, there'll be a test failure on the trunk and the error message hints people to add the missing field/type to the thrift schema from cfgr, so that they are always in sync in practice. Test Plan: buck test mode/opt caffe2/test:test_export -- -r test_thrift_schema_in_sync Differential Revision: D66716834 Pull Request resolved: https://github.com/pytorch/pytorch/pull/141989 Approved by: https://github.com/yiming0416	2024-12-06 18:42:20 +00:00
PyTorch MergeBot	bab15df40a	Revert "[FSDP2] Move to public `torch.distributed.fsdp` (#141868 )" This reverts commit 45583a5df907a7948693c047e5fe2c8349622069. Reverted https://github.com/pytorch/pytorch/pull/141868 on behalf of https://github.com/atalman due to failing internally ([comment](https://github.com/pytorch/pytorch/pull/141868#issuecomment-2523925180))	2024-12-06 18:38:12 +00:00
PyTorch MergeBot	4af7aa5e64	Revert "E2E composability testing (#141398 )" This reverts commit ad93aa854d2d7837c917ae81cfb8f3bf05ee58c9. Reverted https://github.com/pytorch/pytorch/pull/141398 on behalf of https://github.com/atalman due to Sorry need to revert https://github.com/pytorch/pytorch/pull/141868, we can try rebase and reland this after ([comment](https://github.com/pytorch/pytorch/pull/141398#issuecomment-2523908998))	2024-12-06 18:28:51 +00:00
PyTorch MergeBot	683ec42958	Revert "Unbreak dynamic shape minimal arrayref interface tests (#142091 )" This reverts commit 2bfc600644ed59332f9da7b94558b9c4c9562b0d. Reverted https://github.com/pytorch/pytorch/pull/142091 on behalf of https://github.com/atalman due to Breaks internal changes ([comment](https://github.com/pytorch/pytorch/pull/142091#issuecomment-2523906048))	2024-12-06 18:25:54 +00:00
Ryan Guo	f2f95ba813	[dynamo] Remove workaround for `functools.wraps` in functorch (#142014 ) This is no longer needed after #142000. Fixes #123365. D66838774 Pull Request resolved: https://github.com/pytorch/pytorch/pull/142014 Approved by: https://github.com/zou3519 ghstack dependencies: #142000	2024-12-06 17:34:59 +00:00
Ryan Guo	c0ffeab02f	[dynamo] Simplify handling of `functools.wraps` (#142000 ) Previously when Dynamo encounters a `functools.wrap(...)` call, it would check `VariableTracker.can_reconstruct` and graph break if failed. That has 2 issues: 1. Implementation of `can_reconstruct` is incorrect, since logic of reconstructability isn't necessarily encapsulated in `VariableTracker.reconstruct` -- for some VTs like `CellVariable`, it's also in `SideEffects.codegen_save_tempvars`. This is exposed by #134731. 2. We don't always need to reconstruct the result of `functools.wrap(...)`, for those cases we don't want to give up tracing by an early `con_reconstruct` check. Instead we could just let it fall through, and graph break in the actual `reconstruct` call later, if needed. This patch removes the `can_reconstruct` check altogether. It was introduced in #114279, but the added tests pass even without the check now; this might be because of some recent bug fixing on cells and side effects. Fixes #134731, #141514. D66838708 Pull Request resolved: https://github.com/pytorch/pytorch/pull/142000 Approved by: https://github.com/zou3519	2024-12-06 17:34:59 +00:00
Dmitry Rogozhkin	5872a8c6b0	Use task submitter TLS in gloo working threads (#142184 ) Fixes: #86830 CC: @albanD Pull Request resolved: https://github.com/pytorch/pytorch/pull/142184 Approved by: https://github.com/albanD	2024-12-06 17:03:17 +00:00
Sam Larsen	692b5e75ed	[logging] Add triton_compile_time_us column to dynamo_compile (#142068 ) Test Plan: See internal diff [D66799565](https://www.internalfb.com/diff/D66799565) Differential Revision: [D66799565](https://our.internmc.facebook.com/intern/diff/D66799565) Pull Request resolved: https://github.com/pytorch/pytorch/pull/142068 Approved by: https://github.com/c00w	2024-12-06 16:11:57 +00:00
chuanqiw	b64a537993	[CD] xpu nightly manylinux whl with cxx11-abi (#142210 ) Follow https://github.com/pytorch/pytorch/issues/123649 Works for https://github.com/pytorch/pytorch/issues/114850 Pull Request resolved: https://github.com/pytorch/pytorch/pull/142210 Approved by: https://github.com/EikanWang, https://github.com/atalman, https://github.com/malfet	2024-12-06 15:10:47 +00:00
nandesuka	34033cce4d	Enable concat support through inductor using pointwise kernels (#141966 ) Summary: Add ability to always force pointwise kernel for concat codegen through Inductor. Differential Revision: D66669372 Pull Request resolved: https://github.com/pytorch/pytorch/pull/141966 Approved by: https://github.com/eellison, https://github.com/blaine-rister, https://github.com/jansel	2024-12-06 14:28:07 +00:00
IvanKobzarev	661d1f0372	[aotd] non-contiguous NestedTensor mutation in compile (#139630 ) Allow mutations mutations for subclasses that are non-contiguous. Changes: Removing assert in collect_metadata_analysis Main requested testcase: Compilation of NJT.index_put() Adding test in test_nestedtensor.py, that compiles NJT.index_put() It is decomposed to NJT split,unbind, which needed additional `torch._check`, `torch._check_is_size` for NJT.unbind() and guard_size_oblivious() usage in _meta_registrations and _inductor/lowering.py. Special case: If tangent is mutated outside of the graph, it does not participate in backward graph. Autograd in this case will set this tangent to zeros tensor. We handle it separately in CompiledFunction.backward: not doing any processing for this tangent and broadcast to number of expected subclass unwrapped arguments. disabling for dynamo 2 tests: 1/ For nested tensor - symbolic shapes issue on nested_tensor index operation that does splits [0, 0, 0] - there is a failure with "pending unbacked symints". This PR does not add more .tolist()/item() ops than it was before. 2/ As we do not fail with exception in collect_metadata_analysis new paths for dynamo started working and it started failing with smth strange that set_ in storage_offset (because of test for views) handling updates storage "cpu" -> "meta" Pull Request resolved: https://github.com/pytorch/pytorch/pull/139630 Approved by: https://github.com/bdhirsh	2024-12-06 12:18:46 +00:00
xinan.lin	c683839e6e	[AOTI] Clean up temporary files generated by AOTI package loader. (#141773 ) Fix #141772 Pull Request resolved: https://github.com/pytorch/pytorch/pull/141773 Approved by: https://github.com/desertfire, https://github.com/EikanWang	2024-12-06 11:46:47 +00:00
SCh-zx	c8c669ce74	When using a third-party device to test DeviceMesh,the error check for 'test_raises_invalid_device_type' can only prints 'GPU' (#142038 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/142038 Approved by: https://github.com/kwen2501	2024-12-06 11:14:00 +00:00
Ke Wen	cc64ad659d	Detect accelerator type when backend is not specified (#142216 ) Today, when user does `init_process_group()`, without `backend` or `device_id` specification, we would auto-translate it into `cuda:nccl,cpu:gloo`. The idea was to initialize all default backends to cover what the user may do later. A side effect is increase of initialization time and resources. This PR changes it to detecting the accelerator type on the machine, and initialize only the backend for that accelerator. Pull Request resolved: https://github.com/pytorch/pytorch/pull/142216 Approved by: https://github.com/wconstab, https://github.com/XilunWu	2024-12-06 10:55:56 +00:00
cyy	5d3622447d	Enable Wtype-limits (#142099 ) Since it can detect underflow bugs of unsigned integers. Pull Request resolved: https://github.com/pytorch/pytorch/pull/142099 Approved by: https://github.com/ezyang	2024-12-06 08:14:18 +00:00
Ryan Guo	01d7644dc9	[dynamo] Undo some jvp old workarounds in functorch (#142082 ) This basically undoes most of the workarounds introduced in #123118, the root causes of which have been fixed by #142078. Pull Request resolved: https://github.com/pytorch/pytorch/pull/142082 Approved by: https://github.com/zou3519 ghstack dependencies: #142078, #142080, #142081	2024-12-06 08:06:53 +00:00
Ryan Guo	9d54cd1504	[dynamo] Undo some jvp old workarounds in functorch (#142081 ) This basically undoes some workarounds introduced in #119926, the root causes of which have been fixed by #142078 and other changes in Dynamo. Now that Dynamo traces the spec comparison code, the test also needs update: - removing the `_jvp_treespec_compare` calls in fx graph Pull Request resolved: https://github.com/pytorch/pytorch/pull/142081 Approved by: https://github.com/zou3519 ghstack dependencies: #142078, #142080	2024-12-06 08:06:53 +00:00
Ryan Guo	59de5e867b	[dynamo] Undo some vjp old workarounds in functorch (#142080 ) This basically undoes most of the workarounds introduced in #119405, the root causes of which have been fixed by #142078 and other changes in Dynamo. Now that Dynamo traces the spec comparison code, the test also needs update: 1. renaming `o` to `pimals_out` 2. removing the `_vjp_treespec_compare` calls in fx graph Pull Request resolved: https://github.com/pytorch/pytorch/pull/142080 Approved by: https://github.com/zou3519 ghstack dependencies: #142078	2024-12-06 08:06:53 +00:00
Ryan Guo	aab0f32ea4	[dynamo] Properly handle `!=` under user-defined `__eq__` (#142078 ) Previously Dynamo modelled `object.__ne__` as just comparison over value identity; however, in CPython the default `!=` dispatches to `__eq__`, which might've been overriden by user. This patch fixes the behavior divergence. Fixes #142055. Pull Request resolved: https://github.com/pytorch/pytorch/pull/142078 Approved by: https://github.com/jansel, https://github.com/zou3519	2024-12-06 08:06:53 +00:00
Howard Huang	c5cfc6a4c9	[pipelining] forward fix for _validate_schedule (#142211 ) https://github.com/pytorch/pytorch/pull/142009 broke CSV loading since it can no longer handle schedules with `I` and `W`. This was caught in the torchtitan tests which loads a custom CSV file using `I` and `W` https://github.com/pytorch/torchtitan/actions/runs/12188167461/job/34000683921?pr=689. Follow up would be to test a real custom schedule in PyTorch rather than torchtitan. The custom schedule in titan is here: https://github.com/pytorch/torchtitan/blob/main/test/assets/custom_schedule.csv Pull Request resolved: https://github.com/pytorch/pytorch/pull/142211 Approved by: https://github.com/mori360 ghstack dependencies: #142009	2024-12-06 08:04:31 +00:00
eqy	8fc6d3a5d8	[SDPA] Allow user-specified priority order with context manager (#140467 ) TODO: docs changes? For better debuggability of issues like https://github.com/pytorch/pytorch/issues/139298 Better testing, current sketch: ``` Python import torch from torch.nn.functional import scaled_dot_product_attention from torch.nn.attention import SDPBackend, sdpa_kernel q = torch.randn(64, 1024, 8, 64, dtype=torch.half, device='cuda') print(torch._C._get_sdp_priority_order()) orders = [[SDPBackend.CUDNN_ATTENTION, SDPBackend.MATH, SDPBackend.EFFICIENT_ATTENTION], [SDPBackend.MATH, SDPBackend.CUDNN_ATTENTION, SDPBackend.EFFICIENT_ATTENTION], [SDPBackend.EFFICIENT_ATTENTION, SDPBackend.CUDNN_ATTENTION, SDPBackend.MATH]] import time times = list() for order in orders: print(order) with sdpa_kernel(order, set_priority=True): scaled_dot_product_attention(q, q, q) torch.cuda.synchronize() t0 = time.perf_counter() with sdpa_kernel(order, set_priority=True): scaled_dot_product_attention(q, q, q) torch.cuda.synchronize() t1 = time.perf_counter() times.append(t1 - t0) print(times) assert times[0] < times[1] assert times[0] > times[2] assert times[1] > times[2] print(torch._C._get_sdp_priority_order()) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/140467 Approved by: https://github.com/drisspg	2024-12-06 07:56:35 +00:00
PyTorch MergeBot	e7de245ee1	Revert "[reland][dynamo][guards] Consider tensors as immutable for dict tag matches (#141085 )" This reverts commit 8bfc0094e468b0abefe087d671903a1ca738edf0. Reverted https://github.com/pytorch/pytorch/pull/141085 on behalf of https://github.com/williamwen42 due to internal regression ([comment](https://github.com/pytorch/pytorch/pull/141085#issuecomment-2522403360))	2024-12-06 07:50:10 +00:00
PyTorch MergeBot	4a6c056466	Revert "[3/N] Avoid copy in std::get (#141843 )" This reverts commit 671e9c7aba2dc72b65391aa4bba1b9e079c2f1b2. Reverted https://github.com/pytorch/pytorch/pull/141843 on behalf of https://github.com/huydhn due to Sorry fo reverting your change but a bunch of CUDA builds are failing in trunk after this lands due to OOM ([comment](https://github.com/pytorch/pytorch/pull/141843#issuecomment-2522335911))	2024-12-06 07:32:07 +00:00
Ke Wen	8bdcdae733	[DTensor] Support matmul in inference_mode (#142197 ) Fixes #142190 . The solution is to add a `decompose_handler` for `aten.matmul`, similar to how we handle `aten.linear`. With the decomposition, `aten.matmul` becomes `aten.mm` which has sharding strategy registered with DTensor. Pull Request resolved: https://github.com/pytorch/pytorch/pull/142197 Approved by: https://github.com/XilunWu, https://github.com/wz337	2024-12-06 07:15:05 +00:00
Shangdi Yu	02c509669a	Aoti minifier flatten (#141156 ) Flatten the inputs to minifier so AOTI Minifier can handle unflattened inputs and kwargs. - flatten the inputs in minifier - changed the "load_and_run" part of the minifier verification to run on the flattened inputs. - refactored code to keep `torch._inductor.__init__.py` clean - update doc `python test/inductor/test_minifier.py` Pull Request resolved: https://github.com/pytorch/pytorch/pull/141156 Approved by: https://github.com/desertfire	2024-12-06 07:12:45 +00:00
Sun, Jiayi	23e2f8ab3a	[Inductor] add flag for linear binary folding and turn it off by default (#142108 ) Fix https://github.com/pytorch/pytorch/issues/141755. Summary: linear binary folding results in a timm_model(levit_128) accuracy regression, this PR adds flag `enable_linear_binary_folding` for linear binary folding and turn it off by default. Pull Request resolved: https://github.com/pytorch/pytorch/pull/142108 Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel, https://github.com/jansel	2024-12-06 07:12:29 +00:00
Yuanhao Ji	67ba79676f	[Dynamo] Replace `torch._dynamo.optimize()` with `torch.compile()` [7/N] (#140922 ) related commits: - #139706 - #140238 - #140247 - #140253 - #140663 - #140688 - #140922 - #140924 - #140933 Pull Request resolved: https://github.com/pytorch/pytorch/pull/140922 Approved by: https://github.com/williamwen42	2024-12-06 07:07:29 +00:00
main-horse	52b7f0ba12	[DTensor] fix stride of fake tensor produced by `shard_dim_alltoall` (#141835 ) currently, DTensor redistributions involving all2all `Shard(n)->Shard(m)` will generate faulty inductor code when compiled: ```python # torchrun --nproc_per_node=2 crash.py import torch from torch.distributed.device_mesh import init_device_mesh from torch.distributed.tensor import Shard, DTensor mesh = init_device_mesh('cuda', (2,), mesh_dim_names=('ep',)) dt = DTensor.from_local(torch.randn(2, 4, device='cuda'), mesh, [Shard(0)]).requires_grad_() def f(dt): return dt.redistribute(placements=[Shard(1)]).to_local() f(dt).sum().backward() # no crash f = torch.compile(f) f(dt).sum().backward() # crash ``` resulting: ```python [rank1]: Traceback (most recent call last): [rank1]: File "/crash.py", line 11, in <module> [rank1]: f(dt).sum().backward() # crash [rank1]: ^^^^^ ... [rank1]: File "/tmp/torchinductor_main/gu/cgurkeb7tzx7kfsnooolsjefrgoizzylrldrugc52n4avmgiccas.py", line 41, in call [rank1]: assert_size_stride(buf0, (4, 2), (4, 1)) [rank1]: AssertionError: expected size 4==4, stride 2==4 at dim=0 ``` This happens because the current [`register_fake` implementation for `shard_dim_alltoall` ops](`5deca07c0d/torch/distributed/tensor/_collective_utils.py (L32)`) returns an erroneous stride: ```python import torch import torch.distributed as dist from torch._C._distributed_c10d import _register_process_group from torch.distributed.device_mesh import init_device_mesh from torch.distributed.tensor._collective_utils import _shard_dim_alltoall_meta, _get_group_size_by_name mesh = init_device_mesh('cuda', (2,), mesh_dim_names=('ep',)) _register_process_group('ep', mesh['ep'].get_group()) x = torch.randn(2, 4, device='meta') y = _shard_dim_alltoall_meta(x, 0, 1, 'ep') if dist.get_rank() == 0: print(x.shape, x.stride()) # torch.Size([2, 4]) (4, 1) print(y.shape, y.stride()) # torch.Size([4, 2]) (4, 1) ``` --- The proposed fix in the pull request causes the provided example code to compile correctly && stop erroring. However, I know very little about torch internals, and expect there to be something wrong with this patch. Any corrections are appreciated. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141835 Approved by: https://github.com/awgu, https://github.com/tianyu-l	2024-12-06 06:56:03 +00:00
Sun, Jiayi	35752cb1ba	[Inductor] Constrain the shape of other tensor for Conv/Linear + broadcast add fusion. (#141759 ) Fix https://github.com/pytorch/pytorch/issues/141671. Summary: The performance regression of these two timm_models is caused by Conv/Linear + broadcast add fusion run into oneDNN ref path. This PR constrains the shape of other tensor for Conv/Linear + broadcast add fusion to fix this issue. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141759 Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel, https://github.com/jansel	2024-12-06 06:20:41 +00:00
Xu Han	b8eb4b56d8	export AOTI_TORCH_EXPORT on Windows. (#140030 ) Fixes #139954 reproduce UT: ```cmd pytest test/inductor/test_torchinductor_codegen_dynamic_shapes.py -k test_device_assert_dynamic_shapes_cpu ``` Issue: <img width="856" alt="image" src="https://github.com/user-attachments/assets/5fc501a9-54e5-45ac-9fb3-509ec11a7abe"> After fixing: ![Image](https://github.com/user-attachments/assets/883846fb-8e92-4b9c-9400-daab32382a3a) Reland: 1. Declare export on Windows explicitly. 2. Support cpu, cuda and xpu devices. Pull Request resolved: https://github.com/pytorch/pytorch/pull/140030 Approved by: https://github.com/jgong5, https://github.com/desertfire	2024-12-06 04:54:42 +00:00
Mitchell, Frost	20f24e3fbd	[inductor][cpp] Add BMM kernel template for autotuning (#129772 ) This PR adds the Cpp template for BMM, for FP32, FP16, and BF16. See #125683 for more background. 1. Adds `CppBmmTemplate` class which inherits from `CppPackedGemmTemplate`. Given a number of worker threads `num_threads` and batch size `B`, execute the Gemm kernel. For the first `B - (B % num_threads)` batch inputs, run one sub-gemm problem per thread. Then for the remaining `B % num_threads` sub-gemms, we execute each subproblem using the parallelized Gemm kernel. To manage this code, the `GEMM_TEMPLATE` from `CppPackedGemmTemplate` is rendered two different times, one with a single thread and one which includes the parallel OMP pragma. 2. Adapts `CppPackedGemmTemplate` to allow for child class. The `GEMM_TEMPLATE` is separated into different strings to allow for rendering by the child class. Slicing/indexing are adapted to allow for 3D BMM inputs. Additional methods `get_options()` and `_get_params_for_choices()` are added to reduce code duplication. BMM within `dlrm` benchmark has a single input buffer which is used for but X and W inputs. This is currently not supported in this PR. ### Performance On Granite/Sapphire Rapids, cpp_bmm template code uses AMX which requires an expensive transpose operation so the BMM op is rarely selected as faster than the existing external bmm kernel. As a result, speedup on SPR is identical with and without BMM code. Pass rate matches the rates for main exactly. #### Test Summary on Granite Rapids Test Scenario \| Comp Item \| Date \| Compiler \| torchbench \| huggingface \| timm_models -- \| -- \| -- \| -- \| -- \| -- \| -- Single Socket Multi-Threads \| Pass Rate \| gemm autotune\| inductor \| 91%, 73/80 \| 100%, 46/46 \| 100%, 61/61 \| \| \| bmm + gemm autotune \| inductor \| 91%, 73/80 \| 100%, 46/46 \| 100%, 61/61 \| \| Geomean Speedup \| gemm autotune\| inductor \| 2.15x \| 1.91x \| 2.52x \| \| \| bmm + gemm autotune \| inductor \| 2.15x \| 1.96x \| 2.53x Single Core Single-Thread \| Pass Rate \| gemm autotune \| inductor \| 91%, 73/80 \| 100%, 46/46 \| 100%, 61/61 \| \| \| bmm + gemm autotune\| inductor \| 91%, 73/80 \| 100%, 46/46 \| 100%, 61/61 \| \| Geomean Speedup \| inductor_locally_benchmark_586 \| inductor \| 2.43x \| 1.56x \| 2.60x \| \| \| inductor_locally_benchmark_585 \| inductor \| 2.45x \| 1.56x \| 2.63x This is not the case on an older Skylake Xeon machine. For the BMM ops contained in torchbench models, bmm performance improves by 1.10-2.64x. #### BF16 28-core Skylake Xeon \| Model \| Inductor \| GemmAutotune \| Gemm+BMM Autotune \| \|--------\|--------\|--------\|--------\| \| BERT_pytorch \| 1.233x \| 2.597x \| 2.608x \| \| hf_DistilBert \| 1.128x \| 2.242x \| 2.368x \| \| hf_Reformer \| 1.124x \| 1.419x \| 1.590x \| \| hf_T5_base \| 1.012x \| 1.257x \| 1.382x \| \| hf_T5_large \| 1.085x \| 2.228x \| 2.345x \| ## Example BMM Code ``` #include <c10/util/Unroll.h> #include <torch/csrc/inductor/aoti_torch/c/shim.h> template <bool accum> inline void cpp_bmm_micro_gemm_amx_kernel_32_2( AMXState& amx_state, const bfloat16* __restrict__ A, const bfloat16* __restrict__ B, float* __restrict__ C, int64_t K, int64_t lda, int64_t ldb, int64_t ldc, uint8_t tilecfg_rows ) { // TODO(jgong5): add prefetch hint for A, B, C auto loadconfig = [](const amx_tilecfg& cfg) { _tile_loadconfig(&cfg); }; const auto last_k_offset = K / 32 * 32; const auto tail_k_size = K - last_k_offset; if C10_LIKELY (last_k_offset > 0) { amx_state.configure(tilecfg_rows, 64, 32 / 16, 2, loadconfig); } else { amx_state.configure(tilecfg_rows, tail_k_size * sizeof(bfloat16), 32 / 16, 2, loadconfig); } auto load_c = [&]() { _tile_loadd(0, C + 0 * ldc + 0, ldc * sizeof(float)); _tile_loadd(1, C + 0 * ldc + 16, ldc * sizeof(float)); _tile_loadd(2, C + 16 * ldc + 0, ldc * sizeof(float)); _tile_loadd(3, C + 16 * ldc + 16, ldc * sizeof(float)); }; auto zero_c = [&]() { _tile_zero(0); _tile_zero(1); _tile_zero(2); _tile_zero(3); }; if constexpr (accum) { load_c(); } else { zero_c(); } auto compute = [&](int k) { _tile_stream_loadd(4, A + 0 * lda + k, lda * sizeof(bfloat16)); _tile_loadd(6, B + k * ldb + 0, ldb * 2 * sizeof(bfloat16)); _tile_dpbf16ps(0, 4, 6); _tile_loadd(7, B + k * ldb + 32, ldb * 2 * sizeof(bfloat16)); _tile_dpbf16ps(1, 4, 7); _tile_stream_loadd(5, A + 16 * lda + k, lda * sizeof(bfloat16)); _tile_dpbf16ps(2, 5, 6); _tile_dpbf16ps(3, 5, 7); }; #pragma GCC unroll 4 for (int k = 0; k < last_k_offset; k += 32) { compute(k); } auto store_c = [&]() { // store to C _tile_stored(0, C + 0 * ldc + 0, ldc * sizeof(float)); _tile_stored(1, C + 0 * ldc + 16, ldc * sizeof(float)); _tile_stored(2, C + 16 * ldc + 0, ldc * sizeof(float)); _tile_stored(3, C + 16 * ldc + 16, ldc * sizeof(float)); }; // TODO(jgong5): move tail k computation to separate loopnest to save tile configuration overhead if C10_UNLIKELY (tail_k_size > 0) { if C10_LIKELY (last_k_offset > 0) { store_c(); amx_state.configure(tilecfg_rows, tail_k_size * sizeof(bfloat16), 32 / 16, 2, loadconfig); load_c(); } compute(last_k_offset); } store_c(); } template <bool accum> inline void cpp_bmm_micro_gemm_amx_kernel_16_2( AMXState& amx_state, const bfloat16* __restrict__ A, const bfloat16* __restrict__ B, float* __restrict__ C, int64_t K, int64_t lda, int64_t ldb, int64_t ldc, uint8_t tilecfg_rows ) { // TODO(jgong5): add prefetch hint for A, B, C auto loadconfig = [](const amx_tilecfg& cfg) { _tile_loadconfig(&cfg); }; const auto last_k_offset = K / 32 * 32; const auto tail_k_size = K - last_k_offset; if C10_LIKELY (last_k_offset > 0) { amx_state.configure(tilecfg_rows, 64, 16 / 16, 2, loadconfig); } else { amx_state.configure(tilecfg_rows, tail_k_size * sizeof(bfloat16), 16 / 16, 2, loadconfig); } auto load_c = [&]() { _tile_loadd(0, C + 0 * ldc + 0, ldc * sizeof(float)); _tile_loadd(1, C + 0 * ldc + 16, ldc * sizeof(float)); }; auto zero_c = [&]() { _tile_zero(0); _tile_zero(1); }; if constexpr (accum) { load_c(); } else { zero_c(); } auto compute = [&](int k) { _tile_stream_loadd(2, A + 0 * lda + k, lda * sizeof(bfloat16)); _tile_loadd(3, B + k * ldb + 0, ldb * 2 * sizeof(bfloat16)); _tile_dpbf16ps(0, 2, 3); _tile_loadd(4, B + k * ldb + 32, ldb * 2 * sizeof(bfloat16)); _tile_dpbf16ps(1, 2, 4); }; #pragma GCC unroll 4 for (int k = 0; k < last_k_offset; k += 32) { compute(k); } auto store_c = [&]() { // store to C _tile_stored(0, C + 0 * ldc + 0, ldc * sizeof(float)); _tile_stored(1, C + 0 * ldc + 16, ldc * sizeof(float)); }; // TODO(jgong5): move tail k computation to separate loopnest to save tile configuration overhead if C10_UNLIKELY (tail_k_size > 0) { if C10_LIKELY (last_k_offset > 0) { store_c(); amx_state.configure(tilecfg_rows, tail_k_size * sizeof(bfloat16), 16 / 16, 2, loadconfig); load_c(); } compute(last_k_offset); } store_c(); } template <bool accum> inline void cpp_bmm_micro_gemm( AMXState& amx_state, const bfloat16* __restrict__ A, const bfloat16* __restrict__ B, float* __restrict__ C, int64_t M, int64_t N, int64_t K, int64_t lda, int64_t ldb, int64_t ldc ) { AOTI_TORCH_CHECK(N % 32 == 0, "N dimension must be multiple of 32"); AOTI_TORCH_CHECK(K % 2 == 0, "K dimension must be multiple of 2"); // TODO(jgong5): loop unroll for M and N for (int64_t n = 0; n < N; n += 32) { for (int64_t m = 0; m < M; m += 32) { int64_t block_m = std::min<int64_t>(M - m, 32); int64_t m_tail = m; if (block_m >= 32) { cpp_bmm_micro_gemm_amx_kernel_32_2<accum>( amx_state, A + m * lda, B + n, C + m * ldc + n, K, lda, ldb, ldc, 16 ); block_m -= 32; m_tail += 32; } else if (block_m >= 16) { cpp_bmm_micro_gemm_amx_kernel_16_2<accum>( amx_state, A + m * lda, B + n, C + m * ldc + n, K, lda, ldb, ldc, 16 ); block_m -= 16; m_tail += 16; } if (block_m > 0) { cpp_bmm_micro_gemm_amx_kernel_16_2<accum>( amx_state, A + m_tail * lda, B + n, C + m_tail * ldc + n, K, lda, ldb, ldc, block_m ); } } } } void threaded_mm(const bfloat16* X, const bfloat16* W, bfloat16* Y, const int64_t ks_b_index) { constexpr int64_t num_threads = 48; constexpr int64_t N = 64; constexpr int64_t K = 96; constexpr int64_t Mr = 32; constexpr int64_t Nr = 32; constexpr int64_t Kr = 32; constexpr int64_t Nr_blocks = (N + Nr - 1) / Nr; constexpr int64_t Kr_blocks = (K + Kr - 1) / Kr; constexpr int64_t M = static_cast<int64_t>(384L); constexpr int64_t Mr_blocks = (M + Mr - 1) / Mr; constexpr int64_t Mt_blocks = 1; constexpr int64_t Nt_blocks = 1; constexpr int64_t Kt_blocks = 3; constexpr int64_t Mc_blocks = 1; constexpr int64_t Nc_blocks = 1; constexpr int64_t Kc_blocks = 3; constexpr int64_t num_Mc_blocks = (Mr_blocks + Mc_blocks - 1) / Mc_blocks; constexpr int64_t num_Nc_blocks = (Nr_blocks + Nc_blocks - 1) / Nc_blocks; constexpr int64_t num_Mt_blocks = (Mr_blocks + Mt_blocks - 1) / Mt_blocks; constexpr int64_t num_Nt_blocks = (Nr_blocks + Nt_blocks - 1) / Nt_blocks; constexpr int64_t num_Kt_blocks = (Kr_blocks + Kt_blocks - 1) / Kt_blocks; // make sure all partitions are assigned AOTI_TORCH_CHECK( Mt_blocks * Nt_blocks * Kt_blocks * 48 >= Mr_blocks * Nr_blocks * Kr_blocks, "Not all partitions are assigned." ); #pragma omp parallel num_threads(48) { const int tid = omp_get_thread_num(); const int64_t k_group_id = tid / num_Kt_blocks; const int64_t k_slice_id = tid % num_Kt_blocks; const int64_t n_group_id = k_group_id / num_Nt_blocks; const int64_t n_slice_id = k_group_id % num_Nt_blocks; const int64_t k_block_start = k_slice_id * Kt_blocks; const int64_t k_block_end = std::min(k_block_start + Kt_blocks, Kr_blocks); const int64_t n_block_start = n_slice_id * Nt_blocks; const int64_t n_block_end = std::min(n_block_start + Nt_blocks, Nr_blocks); const int64_t m_block_start = std::min(n_group_id * Mt_blocks, Mr_blocks); const int64_t m_block_end = std::min(m_block_start + Mt_blocks, Mr_blocks); const int64_t num_Mc_blocks_per_thread = (m_block_end - m_block_start + Mc_blocks - 1) / Mc_blocks; AMXState amx_state; auto _local_acc_buf = std::make_unique<float[]>(static_cast<int64_t>(Mc_blocksMrNc_blocksNr)); auto local_acc_buf = _local_acc_buf.get(); for (int64_t mc_block_id = 0; mc_block_id < num_Mc_blocks_per_thread; mc_block_id++) { const int64_t my_mc_block_id = (mc_block_id + n_slice_id) % num_Mc_blocks_per_thread; const int64_t mc = m_block_start + my_mc_block_id Mc_blocks; const int64_t m_start = mc * Mr; const int64_t m_end = std::min(std::min(mc + Mc_blocks, m_block_end) * Mr, M); const int64_t m_size = m_end - m_start; for (int64_t nc = n_block_start; nc < n_block_end; nc += Nc_blocks) { const int64_t n_start = nc * Nr; const int64_t n_end = std::min(std::min(nc + Nc_blocks, n_block_end) * Nr, N); const int64_t n_size = n_end - n_start; // NB: assume we pad N, nc_block_end won't exceed padded N here. const int64_t nc_block_end = std::min(nc + Nc_blocks, n_block_end); if (_local_acc_buf == nullptr) { _local_acc_buf = std::make_unique<float[]>(static_cast<int64_t>(Mc_blocksMrNc_blocksNr)); local_acc_buf = _local_acc_buf.get(); } for (int64_t kc = k_block_start; kc < k_block_end; kc += Kc_blocks) { int64_t k_start = kc Kr; int64_t k_end = std::min(std::min(kc + Kc_blocks, k_block_end) * Kr, K); for (int64_t nci = nc; nci < nc_block_end; nci++) { if (kc == k_block_start) { cpp_bmm_micro_gemm<static_cast<bool>(false)>( amx_state, &(X[static_cast<int64_t>(k_start + (96Lm_start) + (36864Lks_b_index))]), &(W[static_cast<int64_t>((32Lk_start) + (3072Lnci) + (6144Lks_b_index))]), &(local_acc_buf[static_cast<int64_t>((Nrnci) + ((-1L)Nrnc))]), static_cast<int64_t>(m_end + ((-1L)m_start)), static_cast<int64_t>(Nr), static_cast<int64_t>(k_end + ((-1L)k_start)), static_cast<int64_t>(96L), static_cast<int64_t>(32L), static_cast<int64_t>(Nc_blocksNr) ); } else { cpp_bmm_micro_gemm<static_cast<bool>(true)>( amx_state, &(X[static_cast<int64_t>(k_start + (96Lm_start) + (36864Lks_b_index))]), &(W[static_cast<int64_t>((32Lk_start) + (3072Lnci) + (6144Lks_b_index))]), &(local_acc_buf[static_cast<int64_t>((Nrnci) + ((-1L)Nrnc))]), static_cast<int64_t>(m_end + ((-1L)m_start)), static_cast<int64_t>(Nr), static_cast<int64_t>(k_end + ((-1L)k_start)), static_cast<int64_t>(96L), static_cast<int64_t>(32L), static_cast<int64_t>(Nc_blocksNr) ); } } } { { #pragma GCC ivdep for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(m_end + ((-1L)m_start)); x0+=static_cast<int64_t>(1L)) { for(int64_t x1=static_cast<int64_t>(0L); x1<static_cast<int64_t>(16L(c10::div_floor_integer(static_cast<int64_t>((n_end + ((-1L)n_start))), static_cast<int64_t>(16L)))); x1+=static_cast<int64_t>(16L)) { auto tmp0 = at::vec::Vectorized<float>::loadu(local_acc_buf + static_cast<int64_t>(x1 + (Nc_blocksNrx0)), static_cast<int64_t>(16)); auto tmp1 = at::vec::convert<bfloat16>(tmp0); tmp1.store(Y + static_cast<int64_t>(n_start + x1 + (64Lm_start) + (64Lx0) + (24576Lks_b_index)), static_cast<int64_t>(16)); } for(int64_t x1=static_cast<int64_t>(16L(c10::div_floor_integer(static_cast<int64_t>((n_end + ((-1L)n_start))), static_cast<int64_t>(16L)))); x1<static_cast<int64_t>(n_end + ((-1L)n_start)); x1+=(static_cast<int64_t>(n_end + ((-1L)n_start) + ((-16L)(c10::div_floor_integer(static_cast<int64_t>((n_end + ((-1L)n_start))), static_cast<int64_t>(16L))))) == 0 ? 1 : static_cast<int64_t>(n_end + ((-1L)n_start) + ((-16L)(c10::div_floor_integer(static_cast<int64_t>((n_end + ((-1L)n_start))), static_cast<int64_t>(16L))))))) { auto tmp0 = at::vec::Vectorized<float>::loadu(local_acc_buf + static_cast<int64_t>(x1 + (Nc_blocksNrx0)), static_cast<int64_t>(n_end + ((-1L)n_start) + ((-16L)(c10::div_floor_integer(static_cast<int64_t>((n_end + ((-1L)n_start))), static_cast<int64_t>(16L)))))); auto tmp1 = at::vec::convert<bfloat16>(tmp0); tmp1.store(Y + static_cast<int64_t>(n_start + x1 + (64Lm_start) + (64Lx0) + (24576Lks_b_index)), static_cast<int64_t>(n_end + ((-1L)n_start) + ((-16L)(c10::div_floor_integer(static_cast<int64_t>((n_end + ((-1L)n_start))), static_cast<int64_t>(16L)))))); } } } } } } amx_state.release([]() { _tile_release(); }); } } void single_thread_mm(const bfloat16* X, const bfloat16* W, bfloat16* Y, const int64_t ks_b_index) { constexpr int64_t num_threads = 1; constexpr int64_t N = 64; constexpr int64_t K = 96; constexpr int64_t Mr = 32; constexpr int64_t Nr = 32; constexpr int64_t Kr = 32; constexpr int64_t Nr_blocks = (N + Nr - 1) / Nr; constexpr int64_t Kr_blocks = (K + Kr - 1) / Kr; constexpr int64_t M = static_cast<int64_t>(384L); constexpr int64_t Mr_blocks = (M + Mr - 1) / Mr; constexpr int64_t Mt_blocks = 12; constexpr int64_t Nt_blocks = 2; constexpr int64_t Kt_blocks = 3; constexpr int64_t Mc_blocks = 12; constexpr int64_t Nc_blocks = 1; constexpr int64_t Kc_blocks = 3; constexpr int64_t num_Mc_blocks = (Mr_blocks + Mc_blocks - 1) / Mc_blocks; constexpr int64_t num_Nc_blocks = (Nr_blocks + Nc_blocks - 1) / Nc_blocks; constexpr int64_t num_Mt_blocks = (Mr_blocks + Mt_blocks - 1) / Mt_blocks; constexpr int64_t num_Nt_blocks = (Nr_blocks + Nt_blocks - 1) / Nt_blocks; constexpr int64_t num_Kt_blocks = (Kr_blocks + Kt_blocks - 1) / Kt_blocks; // make sure all partitions are assigned AOTI_TORCH_CHECK( Mt_blocks * Nt_blocks * Kt_blocks * 1 >= Mr_blocks * Nr_blocks * Kr_blocks, "Not all partitions are assigned." ); { constexpr int tid = 0; constexpr int64_t k_group_id = 0; constexpr int64_t k_slice_id = 0; constexpr int64_t n_group_id = 0; constexpr int64_t n_slice_id = 0; constexpr int64_t m_block_start = 0; constexpr int64_t n_block_start = 0; constexpr int64_t n_block_end = Nr_blocks; constexpr int64_t k_block_start = 0; constexpr int64_t k_block_end = Kr_blocks; constexpr int64_t num_Mc_blocks_per_thread = num_Mc_blocks; constexpr int64_t m_block_end = Mr_blocks; AMXState amx_state; auto _local_acc_buf = std::make_unique<float[]>(static_cast<int64_t>(Mc_blocksMrNc_blocksNr)); auto local_acc_buf = _local_acc_buf.get(); for (int64_t mc_block_id = 0; mc_block_id < num_Mc_blocks_per_thread; mc_block_id++) { const int64_t my_mc_block_id = (mc_block_id + n_slice_id) % num_Mc_blocks_per_thread; const int64_t mc = m_block_start + my_mc_block_id Mc_blocks; const int64_t m_start = mc * Mr; const int64_t m_end = std::min(std::min(mc + Mc_blocks, m_block_end) * Mr, M); const int64_t m_size = m_end - m_start; for (int64_t nc = n_block_start; nc < n_block_end; nc += Nc_blocks) { const int64_t n_start = nc * Nr; const int64_t n_end = std::min(std::min(nc + Nc_blocks, n_block_end) * Nr, N); const int64_t n_size = n_end - n_start; // NB: assume we pad N, nc_block_end won't exceed padded N here. const int64_t nc_block_end = std::min(nc + Nc_blocks, n_block_end); if (_local_acc_buf == nullptr) { _local_acc_buf = std::make_unique<float[]>(static_cast<int64_t>(Mc_blocksMrNc_blocksNr)); local_acc_buf = _local_acc_buf.get(); } for (int64_t kc = k_block_start; kc < k_block_end; kc += Kc_blocks) { int64_t k_start = kc Kr; int64_t k_end = std::min(std::min(kc + Kc_blocks, k_block_end) * Kr, K); for (int64_t nci = nc; nci < nc_block_end; nci++) { if (kc == k_block_start) { cpp_bmm_micro_gemm<static_cast<bool>(false)>( amx_state, &(X[static_cast<int64_t>(k_start + (96Lm_start) + (36864Lks_b_index))]), &(W[static_cast<int64_t>((32Lk_start) + (3072Lnci) + (6144Lks_b_index))]), &(local_acc_buf[static_cast<int64_t>((Nrnci) + ((-1L)Nrnc))]), static_cast<int64_t>(m_end + ((-1L)m_start)), static_cast<int64_t>(Nr), static_cast<int64_t>(k_end + ((-1L)k_start)), static_cast<int64_t>(96L), static_cast<int64_t>(32L), static_cast<int64_t>(Nc_blocksNr) ); } else { cpp_bmm_micro_gemm<static_cast<bool>(true)>( amx_state, &(X[static_cast<int64_t>(k_start + (96Lm_start) + (36864Lks_b_index))]), &(W[static_cast<int64_t>((32Lk_start) + (3072Lnci) + (6144Lks_b_index))]), &(local_acc_buf[static_cast<int64_t>((Nrnci) + ((-1L)Nrnc))]), static_cast<int64_t>(m_end + ((-1L)m_start)), static_cast<int64_t>(Nr), static_cast<int64_t>(k_end + ((-1L)k_start)), static_cast<int64_t>(96L), static_cast<int64_t>(32L), static_cast<int64_t>(Nc_blocksNr) ); } } } { { #pragma GCC ivdep for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(m_end + ((-1L)m_start)); x0+=static_cast<int64_t>(1L)) { for(int64_t x1=static_cast<int64_t>(0L); x1<static_cast<int64_t>(16L(c10::div_floor_integer(static_cast<int64_t>((n_end + ((-1L)n_start))), static_cast<int64_t>(16L)))); x1+=static_cast<int64_t>(16L)) { auto tmp0 = at::vec::Vectorized<float>::loadu(local_acc_buf + static_cast<int64_t>(x1 + (Nc_blocksNrx0)), static_cast<int64_t>(16)); auto tmp1 = at::vec::convert<bfloat16>(tmp0); tmp1.store(Y + static_cast<int64_t>(n_start + x1 + (64Lm_start) + (64Lx0) + (24576Lks_b_index)), static_cast<int64_t>(16)); } for(int64_t x1=static_cast<int64_t>(16L(c10::div_floor_integer(static_cast<int64_t>((n_end + ((-1L)n_start))), static_cast<int64_t>(16L)))); x1<static_cast<int64_t>(n_end + ((-1L)n_start)); x1+=(static_cast<int64_t>(n_end + ((-1L)n_start) + ((-16L)(c10::div_floor_integer(static_cast<int64_t>((n_end + ((-1L)n_start))), static_cast<int64_t>(16L))))) == 0 ? 1 : static_cast<int64_t>(n_end + ((-1L)n_start) + ((-16L)(c10::div_floor_integer(static_cast<int64_t>((n_end + ((-1L)n_start))), static_cast<int64_t>(16L))))))) { auto tmp0 = at::vec::Vectorized<float>::loadu(local_acc_buf + static_cast<int64_t>(x1 + (Nc_blocksNrx0)), static_cast<int64_t>(n_end + ((-1L)n_start) + ((-16L)(c10::div_floor_integer(static_cast<int64_t>((n_end + ((-1L)n_start))), static_cast<int64_t>(16L)))))); auto tmp1 = at::vec::convert<bfloat16>(tmp0); tmp1.store(Y + static_cast<int64_t>(n_start + x1 + (64Lm_start) + (64Lx0) + (24576Lks_b_index)), static_cast<int64_t>(n_end + ((-1L)n_start) + ((-16L)(c10::div_floor_integer(static_cast<int64_t>((n_end + ((-1L)n_start))), static_cast<int64_t>(16L)))))); } } } } } } amx_state.release([]() { _tile_release(); }); } } extern "C" void cpp_bmm(const bfloat16* X, const bfloat16* W, bfloat16* Y) { const int64_t B = static_cast<int64_t>(5L); constexpr int64_t num_threads = 48; int64_t B_single_thread_block = (B / num_threads) * num_threads; #pragma omp parallel for num_threads(48) for (int64_t b_start = 0; b_start < B_single_thread_block; ++b_start) { single_thread_mm(X, W, Y, b_start); } for (int64_t b_start = B_single_thread_block; b_start < B; ++b_start) { threaded_mm(X, W, Y, b_start); } } ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129772 Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel, https://github.com/jansel	2024-12-06 04:54:00 +00:00
Chris Sidebottom	39425feac7	Filter pattern matching tests based on ACL (#141921 ) There are a number of cases where pattern matching differs based on the presence of ACL, causing the tests to fail. This adds `TEST_ACL` and `skipIfACL` so that these tests can still run with different values or be entirely skipped if necessary. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141921 Approved by: https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2024-12-06 04:19:41 +00:00
cyy	671e9c7aba	[3/N] Avoid copy in std::get (#141843 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/141843 Approved by: https://github.com/Skylion007	2024-12-06 04:15:31 +00:00
cyy	4bc8de334f	Remove __ubsan_ignore_undefined__ in some cases (#142120 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/142120 Approved by: https://github.com/ezyang	2024-12-06 04:13:57 +00:00
Edward Z. Yang	646024e823	Add convnext_base to higher tolerance (#142159 ) See https://github.com/pytorch/pytorch/issues/141498 https://github.com/pytorch/pytorch/issues/141703 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/142159 Approved by: https://github.com/bertmaher, https://github.com/huydhn	2024-12-06 04:00:13 +00:00
Blaine Burton Rister	80ca6dd892	[Inductor] Expand dtype aware codegen for libdevice and tl.math ops (#140864 ) # Feature Previously, only the codegen for `torch.sqrt` was dtype aware. This PR updates most of the `libdevice`/`tl.math` ops to support dtype-aware codegen as well. This is often necessary to get correct code when `config.triton.codegen_upcast_to_fp32=False`, as most Triton math ops do not support float16/bfloat16. This PR enables dtype aware codegen via the `maybe_upcast_float32` decorator. This wraps `TritonOverrides` macros to upcast arguments to float32, and downcast the result back to the original dtype. The exception is for ops that return booleans, in which case we set `convert_output=False` and skip the output cast. # Test Plan Added CI tests for all the new ops. The list of ops to test is automatically generated based on uses of the `maybe_upcast_float32` decorator, and stored in the new `OpDtypeSupport` class. In each new test, we search the generated code for upcasts/downcasts using a regex. Also added a unit test for `OpDtypeSupport` which checks that we have correct dtype info for ops that require upcasts. This PR also moves some existing tests around, to collect all the dtype aware codegen tests in one file. Pull Request resolved: https://github.com/pytorch/pytorch/pull/140864 Approved by: https://github.com/eellison, https://github.com/arui-meta Co-authored-by: eellison <elias.ellison@gmail.com>	2024-12-06 03:15:20 +00:00
Colin Peppler	0602676c8d	[CUTLASS][AOTI] Fixes undefined symbol: cudaLaunchKernelExC (#142094 ) Summary: ### Context * When compiling the object file for a CUTLASS kernel, CUDA RT symbols are left undefined. * When compiling the final shared object file, we statically link with `libcudart_static.a`. * One important thing is that ordering matters when specifying the lib search paths (-L). Test Plan: ``` // before diff RuntimeError: Failure loading .so: /tmp/tmpqhz_dnza/model.so: undefined symbol: cudaLaunchKernelExC ``` Differential Revision: D66793974 Pull Request resolved: https://github.com/pytorch/pytorch/pull/142094 Approved by: https://github.com/chenyang78, https://github.com/hl475	2024-12-06 02:18:54 +00:00
Animesh Jain	8bfc0094e4	[reland][dynamo][guards] Consider tensors as immutable for dict tag matches (#141085 ) Reland - https://github.com/pytorch/pytorch/pull/139560 As mentioned in https://github.com/pytorch/pytorch/pull/130341, using `static py::object` can lead to segfaults. I suspect this is the reason for the import system error seen internally (https://www.internalfb.com/sevmanager/view/469592). In this PR, I am removing the `static` part. This is fine and also the right thing to do because this will catch if user changes the flag in the same process for compiling two different functions. Unfortunately, there is no easy way to trigger this segfault, so I can't write a test. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141085 Approved by: https://github.com/jansel Co-authored-by: William Wen <williamwen@meta.com>	2024-12-06 01:49:55 +00:00
Svetlana Karslioglu	ce22a01e11	Add an option for classic search (#142018 ) Fixes https://github.com/pytorch/tutorials/issues/3143 Pull Request resolved: https://github.com/pytorch/pytorch/pull/142018 Approved by: https://github.com/albanD	2024-12-06 01:24:52 +00:00
Joel Schlosser	e803a3d83a	Fix reductions for NJTs with ragged_idx != 1 (#142173 ) Background: conversion from outer dim -> inner dim makes the (previously valid) assumption that the ragged dim is immediately next to the batch dim. This is no longer the case after #137125. This PR: * Updates the outer dim -> inner dim conversion logic to match the actual ragged_idx. Since ragged_idx tells us where the packed ragged / batch dim is, both ragged and batch outer dims should map to this inner dim. The conversion logic must now take in `ragged_idx` to make this possible, so the PR updates all call-sites to pass this. * Fixes outputs across keepdim settings when reducing over ragged / batch dims. Pull Request resolved: https://github.com/pytorch/pytorch/pull/142173 Approved by: https://github.com/drisspg	2024-12-06 01:23:17 +00:00
rzou	6b0df2f720	[torch.func] expand stack_module_state's typing (#142169 ) Summary: https://github.com/pytorch/pytorch/pull/141894 made this API actually typed w.r.t. pyre, which is causing some internal type failures. This PR expands the typing for stack_module_state to squash those failures. Test Plan: - pyre Pull Request resolved: https://github.com/pytorch/pytorch/pull/142169 Approved by: https://github.com/albanD	2024-12-06 01:08:53 +00:00
Nichols A. Romero	93214aad30	[ROCM] Fix unit test: matmul_small_brute_force_tunableop (#142089 ) Fixes #141636 Fixes #141635 Fixes #141458 Changes include: - TunableOp filename that wasn't set properly - Activate numerical check (see additional test comment) - Entire test in try-finally clause to avoid OS environment variable leakage (see additional test comment) Pull Request resolved: https://github.com/pytorch/pytorch/pull/142089 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2024-12-06 00:58:10 +00:00
mori360	ad93aa854d	E2E composability testing (#141398 ) Add 3D(pp+tp+fsdp) test `test_3d_with_tp_dp_pp` at test_pp_compodability Currently provide @parametrize on "ScheduleClass" for pp in [ScheduleGPipe, Schedule1F1B, ScheduleInterleaved1F1B, ScheduleLoopedBFS, ScheduleInterleavedZeroBubble] "MixedPrecisionParam" for fsdp in [torch.bfloat16, torch.float32] Future work: 1. add fp8 2. add cp(context parallelism) to enable 4D test Pull Request resolved: https://github.com/pytorch/pytorch/pull/141398 Approved by: https://github.com/wconstab, https://github.com/kwen2501	2024-12-06 00:53:22 +00:00
Joel Schlosser	461bd2c2f7	Update nested tensor warning to recommend layout=torch.jagged (#142140 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/142140 Approved by: https://github.com/YuqingJ	2024-12-06 00:40:30 +00:00
PyTorch MergeBot	90052a8ae2	Revert "[ROCm] unskip hermite_polynomial_h unit tests (#141150 )" This reverts commit 69f8b3e269641fae93ed7afba49b6df8e44ed3c9. Reverted https://github.com/pytorch/pytorch/pull/141150 on behalf of https://github.com/jeffdaily due to this PR is tied to #141955 and that one was reverted so need to revert this too ([comment](https://github.com/pytorch/pytorch/pull/141150#issuecomment-2521830067))	2024-12-06 00:39:56 +00:00
IvanKobzarev	efab8c433f	[subclass] Fix unwrap_subclass_parameters parametrization (#142155 ) Parametrization can not be registered for non-direct child parameters of the module. We have to iterate through all submodules and register parametrization at every level. Original testcase did not test the nested modules case - adding submodule to the test. Testing: ``` python test/functorch/test_aotdispatch.py -k test_subclass_parameters ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/142155 Approved by: https://github.com/bdhirsh	2024-12-05 23:53:36 +00:00
Scott Wolchok	2bfc600644	Unbreak dynamic shape minimal arrayref interface tests (#142091 ) Simple bug got introduced somewhere. Differential Revision: [D66792420](https://our.internmc.facebook.com/intern/diff/D66792420/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/142091 Approved by: https://github.com/desertfire, https://github.com/hl475	2024-12-05 23:26:35 +00:00
Huy Do	ca7be75e0a	Reduce the nproc when building FA on 8.9 (#142164 ) The newly introduce sm89 build is failing consistently in trunk now because of OOM https://github.com/pytorch/pytorch/actions/runs/12186328178/job/33994606556. I suspect that FlashAttention is the cause. ### Testing CI to see if the build works. Pull Request resolved: https://github.com/pytorch/pytorch/pull/142164 Approved by: https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2024-12-05 23:23:40 +00:00
Oguz Ulgen	4981bd8355	Make cache keys consistent between OSS and internal (#142147 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/142147 Approved by: https://github.com/jamesjwu, https://github.com/masnesral	2024-12-05 22:29:07 +00:00
Chirag Pandya	a9e3281e94	[rfc][be] static assert that nccl version is >= 2.4 (#142023 ) Summary: Static assert that NCCL VERSION is greater than 2.4. This is in preparation of enabling error checking by default in PyTorch library and removal of some macros. This is in PR #141914. The rationale behind this version is: 1. 2.4 released ~2 years ago so it's unlikely that someone is still using the old library. 2. Enabling error checking is benefitial to the community as it helps debug subtle bugs in production environments. Test Plan: unit tests Differential Revision: D66737055 Pull Request resolved: https://github.com/pytorch/pytorch/pull/142023 Approved by: https://github.com/kwen2501	2024-12-05 22:11:14 +00:00
Yifu Wang	5513e2ec35	[SymmetricMemory] use the python version of empty() and rendezvous() for tests and library ops (#142154 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/142154 Approved by: https://github.com/weifengpy	2024-12-05 22:09:36 +00:00
snahir	16ea0ddcdb	Ignore logger methods to avoid graph breaks (#139403 ) Fixes #132635 Calls to logging.logger cause a graph break, this PR allows the user to avoid these graph breaks (for specific methods) by setting DISABLE_LOGS_WHILE_COMPILING to 1. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139403 Approved by: https://github.com/williamwen42	2024-12-05 20:12:26 +00:00
PyTorch MergeBot	41952c1876	Revert "export AOTI_TORCH_EXPORT on Windows. (#140030 )" This reverts commit 38e0f72274cfad88e0f2ca40f27c79cd49413f5e. Reverted https://github.com/pytorch/pytorch/pull/140030 on behalf of https://github.com/malfet due to This broke sm89 builds ([comment](https://github.com/pytorch/pytorch/pull/140030#issuecomment-2521290457))	2024-12-05 20:07:29 +00:00
Bin Bao	39482907be	[AOTI] Refactor codegen_inputs in wrapper codegen (#141965 ) Summary: Fork codegen_inputs for CppWrapperCodegen, because the behavior between python and cpp needs to diverge. On the python side, input backed symbols need to be generated for the autotune block. This is to prepare for one-pass AOTI CUDA codegen. Differential Revision: [D66718225](https://our.internmc.facebook.com/intern/diff/D66718225) Pull Request resolved: https://github.com/pytorch/pytorch/pull/141965 Approved by: https://github.com/chenyang78 ghstack dependencies: #141388, #141387, #141979	2024-12-05 19:49:34 +00:00
Bin Bao	2fd8a7be71	[AOTI] Refactor additional_files generation (#141979 ) Summary: https://github.com/pytorch/pytorch/pull/140675 adds logic to collect all the generated cubin file paths into an additional_files list, but the collection should only happen when DeferredGpuKernelLine is materialized. This is to prepare for one-pass AOTI CUDA codegen. Differential Revision: [D66718227](https://our.internmc.facebook.com/intern/diff/D66718227) Pull Request resolved: https://github.com/pytorch/pytorch/pull/141979 Approved by: https://github.com/chenyang78 ghstack dependencies: #141388, #141387	2024-12-05 19:49:02 +00:00
Yukio Siraichi	7e49da6077	DLPack: add support to PyTorch/XLA (#138470 ) Taking over: #128176. In summary, this PR: - `__dlpack__`: Calls PyTorch/XLA `to_dlpack` function, if the tensor lives in an XLA:CUDA device - `__dlpack_device__`: Correctly maps PyTorch/XLA tensors to `kDLGPU`, if XLA:CUDA is being used The tests are introduced in https://github.com/pytorch/xla/pull/7213. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138470 Approved by: https://github.com/albanD Co-authored-by: iefgnoix <isaacwxf23@gmail.com>	2024-12-05 19:36:36 +00:00
Bin Bao	5f28c42746	[AOIT] Remove several overloaded members from WrapperCodegen (#141387 ) Summary: Remove several overloaded string members from WrapperCodegen classes, including open_bracket, closed_braket, size, stride. Instead of relying on polymorphism, we explicitly generate different strings for PythonWrapperCodegen and CppWrapperCodegen. This is to prepare for one-pass AOTI CUDA codegen. Differential Revision: [D66459991](https://our.internmc.facebook.com/intern/diff/D66459991) Pull Request resolved: https://github.com/pytorch/pytorch/pull/141387 Approved by: https://github.com/chenyang78 ghstack dependencies: #141388	2024-12-05 19:29:38 +00:00
Bin Bao	4cc0fc2707	[AOTI] Remove WrapperCodegen.expr_printer (#141388 ) Summary: Avoid using expr_printer as an overriden class member for WrapperCodegen. Instead, use pexpr and cexpr explicitly for python and cpp expression print respectively. This is to prepare for one-pass AOTI CUDA codegen, where PythonWrapperCodegen is used to generate the autotune block and CppWrapperCodegen is used to generate the model code. Differential Revision: [D66459992](https://our.internmc.facebook.com/intern/diff/D66459992) Pull Request resolved: https://github.com/pytorch/pytorch/pull/141388 Approved by: https://github.com/chenyang78	2024-12-05 19:20:39 +00:00
Nikita Shulga	12b8c2fd8b	Remove lintrunner windows exclusion (#142150 ) As it's available right now, see https://pypi.org/project/lintrunner/0.12.7/#files And no longer ask developers to install rust on the platform Pull Request resolved: https://github.com/pytorch/pytorch/pull/142150 Approved by: https://github.com/wdvr	2024-12-05 19:02:21 +00:00
Angela Yi	a9d84875a9	Fix mha torch._check in jit tracing (#142059 ) Test Plan: `buck2 run @//mode/dev-nosan //mobile-vision/d2go/projects_oss/detr:tests -- -r test_detr_fbnet_export` Differential Revision: D66769339 Pull Request resolved: https://github.com/pytorch/pytorch/pull/142059 Approved by: https://github.com/ezyang	2024-12-05 18:38:17 +00:00
angelayi	540dc0c114	[aoti] Prototype loading from bytes (#142070 ) Loader needs to have an official solution -- I'm pretty sure miniz can do this out of box, but haven't gotten the time to look at it yet. For now it just loads the buffer into a file. Pull Request resolved: https://github.com/pytorch/pytorch/pull/142070 Approved by: https://github.com/henrylhtsang	2024-12-05 18:38:02 +00:00
Bob Ren	a5ec09d0cd	Flip specialize_float to default False in fbcode (#142111 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/142111 Approved by: https://github.com/ezyang	2024-12-05 18:23:47 +00:00
Huy Do	7ff42f7f04	Use the correct CSV filenames for MPS benchmark (#142034 ) After https://github.com/pytorch/pytorch/pull/135386 and https://github.com/pytorch/pytorch/pull/141999, MPS benchmark has been running for a while and the data has been uploaded correctly. However, the dashboard is still using the old schema that requires the output CSV files to be named in a certain way for its query to work https://github.com/pytorch/test-infra/blob/main/torchci/clickhouse_queries/compilers_benchmark_performance/query.sql#L32-L40. Specifically, the filename needs to be in the following format `inductor_${backend}_${suite}_${dtype}_${mode}_${device}_${target}.csv`. The new schema gets away with all this hacky setup, but the dashboard hasn't been migrated to the new schema yet. So, this is a quick way to just get the data to show up first. ### Testing https://github.com/pytorch/pytorch/actions/runs/12153886764 Pull Request resolved: https://github.com/pytorch/pytorch/pull/142034 Approved by: https://github.com/skotapati, https://github.com/malfet	2024-12-05 17:53:58 +00:00
PyTorch MergeBot	cb70d9fd05	Revert "[ATen][Native][Special] Hermite polynomial prematurely return NaN if n is high (#141955 )" This reverts commit 51b7528e274d350c1d5091acc40572d6b43879b8. Reverted https://github.com/pytorch/pytorch/pull/141955 on behalf of https://github.com/atalman due to Failing internal test ([comment](https://github.com/pytorch/pytorch/pull/141955#issuecomment-2521024701))	2024-12-05 17:39:32 +00:00
bhack	ae9cda0221	Add `truediv` support in export serializer (#136364 ) Fixes #136113 - [x] Inital `truediv` coverage - [ ] Expand/reduce coverage? - [x] Add tests - [x] Re-check docstrings - [ ] Linting Pull Request resolved: https://github.com/pytorch/pytorch/pull/136364 Approved by: https://github.com/pianpwk Co-authored-by: Angela Yi <angelayi@meta.com> Co-authored-by: Pian Pawakapan <pianpwk@meta.com>	2024-12-05 17:33:33 +00:00
Paden Milligan	07edb2ec4d	Update documentation for torch.mean() to note behavior with empty tensors (#142039 ) This PR updates the documentation for `torch.mean()` to explicitly mention that computing the mean over an empty tensor returns `nan`. This clarification helps users understand the behavior and handle it appropriately in their code. Fixes #141057 Pull Request resolved: https://github.com/pytorch/pytorch/pull/142039 Approved by: https://github.com/albanD	2024-12-05 17:21:53 +00:00
Sam Larsen	5bc09ac5e9	Remove option for fork-based compile pool (#142001 ) Summary: This has been set to "subproc" for a while internally and externally, so we can remove and simplify some of the code. Note that there's no pressing need here -- just that since we've had internal outage with the legacy "fork" implementation, it doesn't seem helpful to leave it available. But if people aren't in the mood for this sort of cleanup, I won't be offended to abandon it. Test Plan: CI Pull Request resolved: https://github.com/pytorch/pytorch/pull/142001 Approved by: https://github.com/eellison, https://github.com/jansel	2024-12-05 17:02:08 +00:00
Yutao Xu	3cdd997f4c	Update torch-xpu-ops commit pin (#142113 ) Update the torch-xpu-ops commit to [7ecb0b](`7ecb0b1a56`), includes: - Capture rrelu_with_noise noise mutation in compile (Reslove https://github.com/pytorch/pytorch/issues/142102) Pull Request resolved: https://github.com/pytorch/pytorch/pull/142113 Approved by: https://github.com/EikanWang	2024-12-05 17:00:29 +00:00
Yukio Siraichi	f8c212a925	Transform unbacked int expressions into a fresh unbacked int. (#141917 ) Fix: #141419 This PR introduces the `torch.sym_fresh_size` API, which transforms an unbacked int expression into a fresh unbacked int. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141917 Approved by: https://github.com/ezyang	2024-12-05 16:53:44 +00:00
Haifeng Jin	c376b29c67	[CI] Add more tests to the numpy 2 CI (#141925 ) Related to #107302 This PR adds all the tests that failed with NumPy 2, which all have been fixed, to the CI to test with NumPy 2 to prevent regression. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141925 Approved by: https://github.com/albanD	2024-12-05 16:46:21 +00:00
Max Podkorytov	822e8a01c6	[ROCm][Inductor][CK] Add batched gemms into gemm max autotune with CK backend (#141520 ) ## Testing ``` TORCH_LOGS=+torch._inductor pytest --capture=no test/inductor/test_ck_backend.py -k bmm ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/141520 Approved by: https://github.com/chenyang78	2024-12-05 16:03:12 +00:00
PyTorch MergeBot	ca9aeedf40	Revert "[dynamo] Simplify handling of `functools.wraps` (#142000 )" This reverts commit f8cb692d77fb1ab75d6663eb32d71037b82e9107. Reverted https://github.com/pytorch/pytorch/pull/142000 on behalf of https://github.com/atalman due to Newly added test test_functions.py::DefaultsTests::test_tree_map is failing internally ([comment](https://github.com/pytorch/pytorch/pull/142000#issuecomment-2520611808))	2024-12-05 15:23:53 +00:00
Andrea Frittoli	82c140327e	Install magma from a tarball (#140417 ) Magma is built for specific CUDA versions and stored in the ossci-linux bucket. Install it from there rather than the deprecated conda package. There are two places where magma is installed today: - `install_conda.sh`: extract the magma package in the same exact location where conda would install it, using a dedicated `install_magma_conda.sh` script. The new script is included in the relevant Dockerfiles where CUDA+magma is needed - `install_magma.sh`: this script already uses a tarball. Use the new tarball instead of the tarball from the conda package. The format of the new tarball is compatible with the old one, so changes here are minimal:wq Fixes #140538 Test PR: https://github.com/pytorch/pytorch/pull/141584 Pull Request resolved: https://github.com/pytorch/pytorch/pull/140417 Approved by: https://github.com/atalman	2024-12-05 15:20:58 +00:00
PyTorch MergeBot	86d08f0b4a	Revert "[dynamo] Remove workaround for `functools.wraps` in functorch (#142014 )" This reverts commit ed77901ec521f3516c96f9ac2a48e659816c8905. Reverted https://github.com/pytorch/pytorch/pull/142014 on behalf of https://github.com/atalman due to Sorry https://github.com/pytorch/pytorch/pull/142000 is failing internally, need to revert this ([comment](https://github.com/pytorch/pytorch/pull/142014#issuecomment-2520601186))	2024-12-05 15:18:56 +00:00
Andreas Säuberli	d24b147520	Update dead reference link for triplet margin loss (#142071 ) The current link for _Learning local feature descriptors with triplets and shallow convolutional neural networks_ (https://www.bmva.org/bmvc/2016/papers/paper119/index.html) is dead (404). The paper is archived here: https://bmva-archive.org.uk/bmvc/2016/papers/paper119/index.html Pull Request resolved: https://github.com/pytorch/pytorch/pull/142071 Approved by: https://github.com/albanD	2024-12-05 15:01:10 +00:00
Edward Z. Yang	08df79819d	Uniformly pass secrets: inherit to all jobs that go to _linux-build/_linux-test (#141995 ) There's also a new lint to make sure you did it right. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/141995 Approved by: https://github.com/albanD, https://github.com/malfet	2024-12-05 14:52:43 +00:00
atalman	c6c45467a3	Use cxx11-abi for Linux CUDA 12.6 builds (#142064 ) Manylinux 2.28 and cxx11-abi migration. Please see: https://dev-discuss.pytorch.org/t/pytorch-linux-wheels-switching-to-new-wheel-build-platform-manylinux-2-28-on-november-12-2024/2581 Pull Request resolved: https://github.com/pytorch/pytorch/pull/142064 Approved by: https://github.com/kit1980, https://github.com/malfet	2024-12-05 14:51:50 +00:00
CaoE	2d1d125d60	Add BFloat16 support and use a new pack method for flash attention forward kernel (#138783 ) * Add BFloat16 support for BRGEMM flash attention forward kernel * Use a new pack method instead of oneDNN pack for flash attention forward kernel to avoid the output leading dimension limitation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138783 Approved by: https://github.com/jgong5, https://github.com/peterbell10	2024-12-05 14:50:26 +00:00
Yukio Siraichi	470b775d7a	Remove functorch config: `_max_aliased_inputs_with_dynamic_shapes_enabled`. (#141680 ) This PR removes the functorch config that set an upper limit on the number of aliased inputs with dynamic shapes. After moving them to be run at runtime in C++, the compilation time and runtime (in true alias cases) improved, rendering the error no longer relevant. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141680 Approved by: https://github.com/bdhirsh ghstack dependencies: #139554, #139555, #140013	2024-12-05 14:43:58 +00:00
Yukio Siraichi	12d28a5929	Move overlapping guards to C++. (#140013 ) This PR moves the logic for computing the overlapping relations between input tensors that share a storage instance to C++. In summary, this PR: - Moves both `tensors_definitely_do_not_overlap` and part of `compute_overlapping_tensors` to C++ - Introduces a `check_overlapping` function that re-runs `compute_overlapping_tensors`, checking that the result is consistent with what is expected - Introduces the `StorageOverlapChecker` class - Keeps track of overlapping and non-overlapping tensors - Actually checks the overlapping relation (call `check_overlapping`) when all tensors are collected - Introduces the `STORAGE_OVERLAPPING` relational guard - Has a reference to a `StorageOverlapChecker` - Stores the to-be-checked tensors in the checker, and triggers its check - Introduces `install_storage_overlapping_guard` python function - Creates an instance of `StorageOverlapChecker` - Creates 2 instances of the `STORAGE_OVERLAPPING` guard (for overlapping and non-overlapping tensors), referencing the same `StorageOverlapChecker` instance Why is `StorageOverlapChecker` needed? The way `GuardManager` is implemented, we have no control over the order in which the check methods are called, i.e. no control over the order the tensors are collected. So, we can't easily split them in "overlapping" and non-overlapping kinds. Instead, we create 2 instances of `STORAGE_OVERLAPPING` guard, each of which helps collecting the tensors for one of the kinds mentioned above. They are then used in a single `StorageOverlapChecker` instance. Pull Request resolved: https://github.com/pytorch/pytorch/pull/140013 Approved by: https://github.com/bdhirsh ghstack dependencies: #139554, #139555	2024-12-05 14:43:58 +00:00
Yukio Siraichi	3a1ded5caa	Add tensor overlapping guards. (#139555 ) Fix: #118214 This PR replaces the guards introduced by running `_tensors_definitely_do_not_overlap` at compile-time by a single `___check_overlapping` guard. When evaluated, this function calls the original `_tensors_definitely_do_not_overlap` so as to check whether the current state of the inputs are consistent, i.e. tensors that should overlap do overlap, and those that shouldn't don't. In summary, the changes are: - Introduce `StorageOverlap` derived class from `GuardEnvExpr` - Plumb `AOTConfig` to the `compute_overlapping_inputs` function, so as to have access to AOTAutograd input sources - Suppress the guards generated by `_tensors_definitely_do_not_overlap` function at runtime - Issue a `StorageOverlap` AOTAutograd guard, specifying the sources that should and shouldn't overlap Pull Request resolved: https://github.com/pytorch/pytorch/pull/139555 Approved by: https://github.com/bdhirsh ghstack dependencies: #139554	2024-12-05 14:43:58 +00:00
Yukio Siraichi	cbfab8b4de	Add `tensor._base` as a tracked fake for `ShapeEnv` guards. (#139554 ) This PR fixes the issue where AOTAutograd would produce a guard that used a symbolic value that came from one of the input's base. ```python @torch.compile(backend="aot_eager", dynamic=True) def f(a, b): a.add_(1) b.add_(1) return a x = torch.ones(10) f(x[1:], x[1:]) ``` In the example above, AOTAutograd functionalizes the mutation by making use of `as_strided_scatter` operation, which produces the guard: `s0 >= s1 + 1`, where: - `s0`: corresponds to `x.size()[0]` - `s1`: corresponds to `a.size()[0]` Pull Request resolved: https://github.com/pytorch/pytorch/pull/139554 Approved by: https://github.com/bdhirsh	2024-12-05 14:43:58 +00:00
Huy Do	27bf7d62e7	Enable retry on A100 perf nightly (#142074 ) This is a quick mitigation while the investigation on https://github.com/pytorch/pytorch/issues/142069 going. Pull Request resolved: https://github.com/pytorch/pytorch/pull/142074 Approved by: https://github.com/jeanschmidt	2024-12-05 14:35:53 +00:00
Laith Sakka	6183c90e99	Avoid recursion in FloorDiv constructor (#142057 ) address https://github.com/pytorch/pytorch/issues/141215 and max recursion issue in this also optimize perf by avoiding a lot of sympy expressions construction. Pull Request resolved: https://github.com/pytorch/pytorch/pull/142057 Approved by: https://github.com/ezyang	2024-12-05 14:25:28 +00:00
Aaron Orenstein	895c8ce5b3	MetaTensorDesc changes for reconstructing proper FakeTensors (#141926 ) A few changes to MetaTensorDesc and friends: 1. Change view_func from a raw method to an ADT where the common case (FakeTensor._view_func_unsafe) is a simple representation instead. 2. (minor) Remove and fix some `type: ignore`s added by #141839 3. (minor) Fix _UNSERIALIZABLE to be a set instead of a dict which is converted into a set each time it's used. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141926 Approved by: https://github.com/ezyang	2024-12-05 14:21:57 +00:00
Huamin Li	65c2086d45	fix the lint from D66795414 (#142122 ) Summary: this diff is to fix the lint issues from D66457500 / https://github.com/pytorch/pytorch/pull/142056 Test Plan: OSS CI Reviewed By: houseroad, FulinHuang Differential Revision: D66795414 Pull Request resolved: https://github.com/pytorch/pytorch/pull/142122 Approved by: https://github.com/houseroad	2024-12-05 12:05:51 +00:00
Xu Han	38e0f72274	export AOTI_TORCH_EXPORT on Windows. (#140030 ) Fixes #139954 reproduce UT: ```cmd pytest test/inductor/test_torchinductor_codegen_dynamic_shapes.py -k test_device_assert_dynamic_shapes_cpu ``` Issue: <img width="856" alt="image" src="https://github.com/user-attachments/assets/5fc501a9-54e5-45ac-9fb3-509ec11a7abe"> After fixing: ![Image](https://github.com/user-attachments/assets/883846fb-8e92-4b9c-9400-daab32382a3a) Reland: 1. Declare export on Windows explicitly. 2. Support cpu, cuda and xpu devices. Pull Request resolved: https://github.com/pytorch/pytorch/pull/140030 Approved by: https://github.com/jgong5, https://github.com/desertfire	2024-12-05 11:25:55 +00:00
Mwiza Kunda	ad2cc96218	Refactor test_torchinductor_strided_blocks to also support triton CPU (#141587 ) This increases test coverage for triton CPU from just test_torchinductor.py to also testing block pointer lowering. Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/141587 Approved by: https://github.com/jansel	2024-12-05 09:57:08 +00:00
Yu, Guangye	8dd4673cea	Support torch.xpu.mem_get_info API (#141230 ) # Motivate Fix https://github.com/pytorch/pytorch/issues/130599 This PR intends to add a new API, `torch.xpu.mem_get_info,` which is widely used in popular model workloads. For example, [here](`403c0714d1/src/accelerate/utils/modeling.py (L721)`) we need to get current GPU memory usage to split or load the model. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141230 Approved by: https://github.com/EikanWang, https://github.com/albanD	2024-12-05 08:17:25 +00:00
Sherlock Huang	0be004ff37	Enable fuse_by_partitions to always return output as tuple (#142056 ) Summary: aot_compile only accept a graph with tuple output. we introduce an option to fuse_by_partitions to alway return outputs as tuple, even if it only have a single entry. Test Plan: OSS CI Differential Revision: D66457500 Pull Request resolved: https://github.com/pytorch/pytorch/pull/142056 Approved by: https://github.com/angelayi, https://github.com/hl475	2024-12-05 08:07:41 +00:00
Oguz Ulgen	f675f644fd	Cleanup between each test in test/test_utils_config_module.py (#142087 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/142087 Approved by: https://github.com/ezyang	2024-12-05 07:17:27 +00:00
cyy	aa95618268	[2/N] Apply py39 ruff fixes (#141938 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/141938 Approved by: https://github.com/ezyang	2024-12-05 06:26:06 +00:00
cyy	653efe14e4	[3/N] Enable UBSAN tests (#142022 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/142022 Approved by: https://github.com/ezyang	2024-12-05 06:06:53 +00:00
Yutao Xu	b31d3b2f41	Update torch-xpu-ops commit pin (#141949 ) Update the torch-xpu-ops commit to [f31219](`f312190a92`), includes: - Add lazy init for empty_xpu - Fix nan propagation error for soft_shrink Pull Request resolved: https://github.com/pytorch/pytorch/pull/141949 Approved by: https://github.com/EikanWang	2024-12-05 05:22:38 +00:00
Yiming Zhou	31f2d4eb4e	[export] Update docs (#142011 ) Summary: Update export docs. Including: 1. Update the output graph. 2. Misc fixes for examples. Test Plan: CI Differential Revision: D66726729 Pull Request resolved: https://github.com/pytorch/pytorch/pull/142011 Approved by: https://github.com/angelayi	2024-12-05 03:44:46 +00:00
Brian Hirsh	471017cbc9	avoid specializing strides with DDPOptimizer + inductor (#140751 ) Fixes https://github.com/pytorch/pytorch/issues/140229 Fixes https://github.com/pytorch/pytorch/issues/139474 The issue was that: (1) DDPOptimizer has some logic to partition the dynamo graph into buckets, and run AOTAutograd/inductor on each bucket (2) doing so requires knowing the exact strides of the outputs of each subgraph, so we can have example inputs (with correct strides) to each of the later subgraphs to compile with (3) there is some existing logic to do this today: we have a `fakify_first_call` flag in AOTAutograd that lets you run it with fake tensor inputs (to handle the calling convention changes that AOTAutograd performs at runtime). During this process, we query inductor for the output strides that it compiled with (4) these outputs strides are stored in the FX graph cache as raw strings of sympy expressions. We have a function, `evaluate_symexpr`, which given the sympy string, and the ShapeEnv's `var_to_val` mapping, will evaluate the sympy string to generate concrete strides (5) evaluating this expression will specialize on the exact values of any variables in our shape env, however. In DDPOptimizer, we want to know what inductor's stride outputs are symbolically. This requires converting the (string) sympy expression into actual `SymInts` that we can return. Pull Request resolved: https://github.com/pytorch/pytorch/pull/140751 Approved by: https://github.com/eellison	2024-12-05 03:41:12 +00:00
Mu-Chu Lee	b08bc07cd7	[AOTInductor] Option to not include weight in .so (#141997 ) Summary: Add an option in config to not include weights in .so Test Plan: `test/inductor:test_aot_inductor -- -r test_so_without_weight_cuda` Reviewed By: desertfire Differential Revision: D65968885 Pull Request resolved: https://github.com/pytorch/pytorch/pull/141997 Approved by: https://github.com/desertfire	2024-12-05 03:35:54 +00:00
Shangdi Yu	51cbac4e6a	[export] Change fx graph _replace_hook to a list of Callable (#142006 ) Summary: Change fx graph module's _replace_hook from a single hook, to a list of hooks. This is to prepare to registering more hooks for inductor provenance tracking, where we might need to register multiple hooks for node replacement. Test Plan: ``` buck run mode/dev-nosan caffe2/test:fx -- -r test_hooks_for_node_update buck run mode/dev-nosan caffe2/test:test_export -- -r test_replace_hook ``` Differential Revision: D66726724 Pull Request resolved: https://github.com/pytorch/pytorch/pull/142006 Approved by: https://github.com/zhxchen17	2024-12-05 03:26:48 +00:00
Andrew Gu	45583a5df9	[FSDP2] Move to public `torch.distributed.fsdp` (#141868 ) Overview This PR moves `torch/distributed/_composable/fsdp` to `torch/distributed/fsdp/_fully_shard` and makes public APIs available from `torch.distributed.fsdp`, e.g.: ``` from torch.distributed.fsdp import fully_shard ``` This is targeting 2.6 release. I rewrote some of the documentation with (hopefully) improved phrasing. Follow-Ups - [x] Add some explanation in the docs about FSDP1 vs. FSDP2 - [ ] Move unit tests from `test/distributed/_composable/fsdp` to `test/distributed/fsdp/fully_shard/` Pull Request resolved: https://github.com/pytorch/pytorch/pull/141868 Approved by: https://github.com/kwen2501, https://github.com/wconstab, https://github.com/weifengpy Co-authored-by: Svetlana Karslioglu <svekars@meta.com>	2024-12-05 03:04:01 +00:00
Blaine Burton Rister	f9af86de01	[Inductor] Represent tiling as a dict (#141751 ) # Summary Preparatory refactor for https://github.com/pytorch/pytorch/pull/137243. This makes it easier to generalize to multi-dimensional reductions. This diff refactors `self.numels` from a tuple like `(8,16)` to a dict like `{"x": 8, "r": 16}`. Note: this is based off of https://github.com/pytorch/pytorch/pull/141738, which enables `tree.is_reduction`. That PR should land first. # Test plan The existing CI provides good coverage. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141751 Approved by: https://github.com/jansel	2024-12-05 02:28:16 +00:00
Brian Hirsh	ff0cfec4c0	AsyncCollectiveTensor: fix _are_we_tracing() in dynamo (#142075 ) Fixes https://github.com/pytorch/pytorch/issues/142076. Under compile, functional collectives are supposed to not return `AsyncCollectiveTensor`, and instead immediately issue calls to `wait_tensor()` (that we rely on the compiler to reorder as necessary. This is done with a function `_are_we_tracing()`, that tries to detect if we are running from inside of the compiler. One of the checks it performs is `is_torchdynamo_compiling()` ([here](https://github.com/pytorch/pytorch/blob/main/torch/distributed/_functional_collectives.py#L808C8-L808C34)). Unfortunately, this will always return False, even if dynamo is indeed tracing. The problem is that this function only returns true if dynamo intercepts the bytecode for `is_torchdynamo_compiling()`. However, this function is called during fake-tensor propagation, which is run as part of dynamo, but is not actually intercepted by dynamo itself. One thing that we know is the case during dynamo tracing, however, is that a `FakeTensorMode` is active. So I tweaked the logic to assume that we are tracing if there is an active fake mode. This could potentially have consequences for anybody running functional collectives with a fake mode directly, without compile in the loop. Although hopefully it's not too unreasonable to issue wait() calls immediately if you are running with fake tensor (presumably you only care about fake tensor propagation, in which case the wait() calls should technically be a no-op). Pull Request resolved: https://github.com/pytorch/pytorch/pull/142075 Approved by: https://github.com/yifuwang, https://github.com/kwen2501 ghstack dependencies: #141725, #141728	2024-12-05 02:01:18 +00:00
PyTorch MergeBot	dbd7b820dd	Revert "[ROCm] port CK rowwise F8 from fbgemm (#140856 )" This reverts commit 291626fb22832f9381524be73241b495efa60532. Reverted https://github.com/pytorch/pytorch/pull/140856 on behalf of https://github.com/atalman due to Failing internal build ([comment](https://github.com/pytorch/pytorch/pull/140856#issuecomment-2518911997))	2024-12-05 01:51:40 +00:00
Benjamin Glass	7e77c5ffba	cpp_wrapper: input kwargs to custom ops (#141370 ) Fixes a situation where kwargs were being passed to a Python fallback op, but as args rather than kwargs. This does not work for arguments that are kwarg-only. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141370 Approved by: https://github.com/desertfire ghstack dependencies: #141368, #141580, #141369	2024-12-05 00:58:01 +00:00
Benjamin Glass	dd7debdbe8	cpp_wrapper: rethrow Python exceptions, when present (#141369 ) When running fallback operations in `cpp_wrapper` mode, Python errors thrown in the fallback should be propagated up the stack. This PR fixes the current situation, which discards all Python errors thrown in the fallback op in favor of an uninformative `RuntimeError`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141369 Approved by: https://github.com/desertfire ghstack dependencies: #141368, #141580	2024-12-05 00:58:01 +00:00
Benjamin Glass	4613bd393d	cpp_wrapper: Add support for torch.device arguments (#141580 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/141580 Approved by: https://github.com/desertfire ghstack dependencies: #141368	2024-12-05 00:58:01 +00:00
Benjamin Glass	923a778f97	cpp_wrapper: Complete support for Layout arguments (#141368 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/141368 Approved by: https://github.com/desertfire	2024-12-05 00:58:01 +00:00
drisspg	3fdc74ae29	Fix dumb typo (#142079 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/142079 Approved by: https://github.com/jainapurva, https://github.com/soulitzer	2024-12-05 00:43:49 +00:00
James Wu	60a192036b	Refactor optional graph module into CompiledFxGraphConstants (#141897 ) FXGraphCache supports freezing, but AOTAutogradCache does not. This is due to the fact that when freezing is turned on, instead of using the constants from the graph module that was saved on cache miss, we have to take the constants from the AOTAutograd generated graph module. This PR does two things: - It bypasses AOTAutogradCache when freezing is turned on. We should have always been doing this. - It refactors the code to be way more clear about the constants we're using and when we're using them. Basically, there are two possible sets of constants we can grab from the compiled fx graph. 1. If freezing is turned off, we save the constants directly in CompiledFxGraph. 2. If freezing is turned on, we save the names of the constants in CompiledFxGraph, and use the runtime GraphModule's actual constant values: we reconstruct them from the saved names + the new graph module from AOTDispatch. We implement two different classes for doing just this: one that has access to the post aotdispatch gm, which supports freezing, and one that doesn't have it, which does not support freezing. Then we construct the wrappers and unwrap the result as needed. This makes it clear that the gm passed to AOTAutogradCache is not part of post compile, only the cache key generated from it is. The whole flow is pretty confusing, but hopefully this gives us better types and static information for understanding what the different codepaths are doing. Will add a specific AOTAutogradCache to confirm we bypass freezing. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141897 Approved by: https://github.com/ezyang, https://github.com/masnesral	2024-12-05 00:34:14 +00:00
William Wen	25d9fa84ea	[CI, 3.13] enable dynamo_wrapped unittests in 3.13 (#141264 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/141264 Approved by: https://github.com/StrongerXi, https://github.com/atalman ghstack dependencies: #141409, #142003, #141572, #141577, #141605, #141621, #141623, #141673, #141674, #141858, #141862, #139533, #140733, #141859, #141860, #141886, #141887, #141950, #141951	2024-12-05 00:33:26 +00:00
William Wen	797a347cd0	[ci, 3.13] disable segfaulting dynamo-wrapped profiler test (#141951 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/141951 Approved by: https://github.com/sraikund16, https://github.com/atalman ghstack dependencies: #141409, #142003, #141572, #141577, #141605, #141621, #141623, #141673, #141674, #141858, #141862, #139533, #140733, #141859, #141860, #141886, #141887, #141950	2024-12-05 00:33:26 +00:00
William Wen	ae71240780	[ci, 3.13] fix/skip failing numpy 2.0+ dynamo-wrapped tests (#141950 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/141950 Approved by: https://github.com/StrongerXi, https://github.com/atalman ghstack dependencies: #141409, #142003, #141572, #141577, #141605, #141621, #141623, #141673, #141674, #141858, #141862, #139533, #140733, #141859, #141860, #141886, #141887	2024-12-05 00:33:26 +00:00
William Wen	fbd130a41f	[ci, 3.13] skip failing module tracker dynamo-wrapped test (#141887 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/141887 Approved by: https://github.com/StrongerXi, https://github.com/atalman ghstack dependencies: #141409, #142003, #141572, #141577, #141605, #141621, #141623, #141673, #141674, #141858, #141862, #139533, #140733, #141859, #141860, #141886	2024-12-05 00:33:26 +00:00
William Wen	9e474231d7	[ci, 3.13] skip failing torch.package dynamo-wrapped test (#141886 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/141886 Approved by: https://github.com/PaliC ghstack dependencies: #141409, #142003, #141572, #141577, #141605, #141621, #141623, #141673, #141674, #141858, #141862, #139533, #140733, #141859, #141860	2024-12-05 00:33:26 +00:00
William Wen	408669a559	[dynamo, 3.13] disable 3.13.0 warning in dynamo-wrapped tests (#141860 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/141860 Approved by: https://github.com/StrongerXi, https://github.com/atalman ghstack dependencies: #141409, #142003, #141572, #141577, #141605, #141621, #141623, #141673, #141674, #141858, #141862, #139533, #140733, #141859	2024-12-05 00:33:26 +00:00
William Wen	d34235a2a3	[dynamo, 3.13] add JUMP_BACKWARD_NO_INTERRUPT to terminal opcodes (#141859 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/141859 Approved by: https://github.com/StrongerXi, https://github.com/atalman ghstack dependencies: #141409, #142003, #141572, #141577, #141605, #141621, #141623, #141673, #141674, #141858, #141862, #139533, #140733	2024-12-05 00:33:26 +00:00
William Wen	2f45484331	[ci] add 3.13 inductor unittests to CI (#140733 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/140733 Approved by: https://github.com/malfet, https://github.com/atalman ghstack dependencies: #141409, #142003, #141572, #141577, #141605, #141621, #141623, #141673, #141674, #141858, #141862, #139533	2024-12-05 00:33:26 +00:00
Yuanhao Ji	3baf8859e6	[Dynamo] Replace `torch._dynamo.optimize()` with `torch.compile()` [4/N] (#140253 ) related commits: - #139706 - #140238 - #140247 - #140253 - #140663 - #140688 Pull Request resolved: https://github.com/pytorch/pytorch/pull/140253 Approved by: https://github.com/soulitzer	2024-12-05 00:30:00 +00:00
William Wen	416f500bfe	[CI, 3.13] enable 3.13 CI (#139533 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139533 Approved by: https://github.com/atalman, https://github.com/malfet ghstack dependencies: #141409, #142003, #141572, #141577, #141605, #141621, #141623, #141673, #141674, #141858, #141862	2024-12-05 00:25:03 +00:00
William Wen	abc4111348	[ci, 3.13] skip dynamo-xpass'd numpy tests in numpy >= 2.0 (#141862 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/141862 Approved by: https://github.com/StrongerXi, https://github.com/atalman ghstack dependencies: #141409, #142003, #141572, #141577, #141605, #141621, #141623, #141673, #141674, #141858	2024-12-05 00:25:02 +00:00
William Wen	76d1047629	[dynamo, 3.13] support CONVERT_VALUE (#141858 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/141858 Approved by: https://github.com/StrongerXi, https://github.com/atalman ghstack dependencies: #141409, #142003, #141572, #141577, #141605, #141621, #141623, #141673, #141674	2024-12-05 00:24:55 +00:00
William Wen	40c959484c	[ci, 3.13] disable segfaulting profiler tests in 3.13 (#141674 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/141674 Approved by: https://github.com/sraikund16, https://github.com/atalman ghstack dependencies: #141409, #142003, #141572, #141577, #141605, #141621, #141623, #141673	2024-12-05 00:24:48 +00:00
William Wen	c946d82077	[ci, 3.13] disable another failing cpp_extension test in 3.13 (#141673 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/141673 Approved by: https://github.com/StrongerXi, https://github.com/atalman ghstack dependencies: #141409, #142003, #141572, #141577, #141605, #141621, #141623	2024-12-05 00:24:42 +00:00
William Wen	cd56cd30f2	[ci, 3.13] disable failing cpp_extension test due to weights_only error in numpy 2.1 (#141623 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/141623 Approved by: https://github.com/mikaylagawarecki, https://github.com/atalman ghstack dependencies: #141409, #142003, #141572, #141577, #141605, #141621	2024-12-05 00:24:35 +00:00
William Wen	2be8d16247	[ci, 3.13] disable some quantization tests affected by numpy 2.1 overflow error (#141621 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/141621 Approved by: https://github.com/jerryzh168, https://github.com/atalman ghstack dependencies: #141409, #142003, #141572, #141577, #141605	2024-12-05 00:24:29 +00:00
William Wen	314e5dd1d1	[ci, 3.13] skip some parts of a failing jit test in 3.13 (#141605 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/141605 Approved by: https://github.com/davidberard98, https://github.com/atalman ghstack dependencies: #141409, #142003, #141572, #141577	2024-12-05 00:24:22 +00:00
William Wen	1a44f01beb	[ci, 3.13] update test_testing.py usage of locals() for 3.13 (#141577 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/141577 Approved by: https://github.com/StrongerXi, https://github.com/atalman ghstack dependencies: #141409, #142003, #141572	2024-12-05 00:24:14 +00:00
William Wen	9459952175	[ci, 3.13] update tensorboard version for 3.13 to fix broken tests (#141572 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/141572 Approved by: https://github.com/StrongerXi, https://github.com/atalman ghstack dependencies: #141409, #142003	2024-12-05 00:24:07 +00:00
William Wen	c93dd531d3	format test_monitor.py and test_tensorboard.py (#142003 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/142003 Approved by: https://github.com/StrongerXi, https://github.com/atalman ghstack dependencies: #141409	2024-12-05 00:23:54 +00:00
William Wen	22ae34af88	[torch.package, 3.13] fixes to torch.package for 3.13 (#141409 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/141409 Approved by: https://github.com/PaliC, https://github.com/atalman	2024-12-05 00:23:47 +00:00
Huy Do	e6e75ebd0a	Silent TD warnings when there is no td_results.json (#142083 ) Despite the fact that we have `continue-on-error: true` there, GH behaves noisily when `td_results.json` doesn't exist. For example, all benchmark jobs in https://github.com/pytorch/pytorch/actions/runs/12149624686 finished successfully but they all showed up as errors on GH UI. To make this worst, log classifier sometimes pick up the error https://github.com/pytorch/pytorch/actions/runs/12149624686/job/33882285001#step:16:37 Pull Request resolved: https://github.com/pytorch/pytorch/pull/142083 Approved by: https://github.com/clee2000	2024-12-04 23:43:29 +00:00
UV	0318589e87	Changed 'standard-deviation' to 'variance' in GroupNorm documentation (#141982 ) Fixes #141315 Updated the GroupNorm documentation to replace 'standard-deviation' with 'variance' to accurately reflect the calculation method. @pytorchbot label "topic: not user facing" Pull Request resolved: https://github.com/pytorch/pytorch/pull/141982 Approved by: https://github.com/mikaylagawarecki	2024-12-04 22:49:45 +00:00
James Wu	326f487809	Bypass AutogradCache when view replay affects the mutation meta (#141978 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/141978 Approved by: https://github.com/bdhirsh	2024-12-04 22:13:12 +00:00
Nikita Shulga	fa2fe9cafb	Delete linux-focal-cuda12.4-py3.10-gcc9-sm86 from trunk (#142073 ) As it exactly mirrors the the job on pull.yml, see `c83b739f14/.github/workflows/pull.yml (L479-L495)` And also HUD [permalink](`53768d67ab/3`): <img width="1201" alt="image" src="https://github.com/user-attachments/assets/d3f0f81c-843b-4f96-82ce-9fd18ebfe2ad"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/142073 Approved by: https://github.com/seemethere, https://github.com/huydhn, https://github.com/atalman	2024-12-04 22:04:55 +00:00
Jeff Daily	69f8b3e269	[ROCm] unskip hermite_polynomial_h unit tests (#141150 ) Large n input caused a regression starting in ROCm 6.1. The for loop will run for an excessive number of iterations. The root cause seems to be how static_cast<int64_t> behaves for large float values such as 1e20 that certain unit tests will use. The workaround is to break out of the loop once the returned value reaches nan. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141150 Approved by: https://github.com/eqy, https://github.com/malfet	2024-12-04 22:01:57 +00:00
Haifeng Jin	53768d67ab	Fix unit test failures with SciPy 1.13+ (#141986 ) Related to #107302 To use `numpy>=2`, we need to upgrade `scipy>=1.13.0` from `1.11.0`. This PR fixes a failed test caused by the `scipy` upgrade. The `scipy` implementation of `logsumexp` has changed and deviated from the torch implementation. So, we replace it with a simple custom implementation as the ground truth. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141986 Approved by: https://github.com/rgommers, https://github.com/albanD	2024-12-04 21:41:38 +00:00
Nikita Shulga	54324fc2d9	[MPS] Release MetalShaderLibrary cached resources (#142053 ) By releasing retained `id<MTLFunction>` and `id<MTLComputePipelineState>` Please note, that `id<MTLLibrary>` associated with class are currently leaked, which is by design, all dynamic shader allocations shoudl use `DynamicMetalShaderLibrary` Test plan: `leaks --atExit -- ./bin/mps_test_metal_library` Before: ``` STACK OF 1 INSTANCE OF 'ROOT LEAK: <_MTLFunctionInternal>': 18 dyld 0x197a94274 start + 2840 17 mps_test_metal_library 0x1002cb420 main + 68 16 mps_test_metal_library 0x1002fa388 testing::UnitTest::Run() + 124 15 mps_test_metal_library 0x1002fa40c bool testing::internal::HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl, bool (testing::internal::UnitTestImpl::)(), char const) + 80 14 mps_test_metal_library 0x1002fac50 testing::internal::UnitTestImpl::RunAllTests() + 1588 13 mps_test_metal_library 0x1002e9934 testing::TestSuite::Run() + 1032 12 mps_test_metal_library 0x1002e8688 testing::TestInfo::Run() + 960 11 mps_test_metal_library 0x1002e715c testing::Test::Run() + 812 10 mps_test_metal_library 0x1002e7200 void testing::internal::HandleExceptionsInMethodIfSupported<testing::TestSuite, void>(testing::TestSuite, void (testing::TestSuite::)(), char const) + 80 9 mps_test_metal_library 0x1002c5518 MPSTestMetalLibrary_ArangeShader_Test::TestBody() + 420 8 libtorch_cpu.dylib 0x10fdd3804 at::native::mps::MetalShaderLibrary::getKernelFunction(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&) + 56 7 libtorch_cpu.dylib 0x10fdd3394 at::native::mps::MetalShaderLibrary::getLibraryPipelineState(id<MTLLibrary>, std::__1::basic_string<char, id<MTLLibrary>::char_traits<char>, id<MTLLibrary>::allocator<char>> const&) + 268 6 com.apple.Metal 0x1a2be43b4 -[_MTLLibrary newFunctionWithName:] + 28 5 com.apple.Metal 0x1a2be4498 -[_MTLLibrary newFunctionWithNameInternal:] + 148 4 com.apple.Metal 0x1a2be4580 MTLLibraryContainer::functionWithName(NSString, id<MTLDevice>) + 68 3 com.apple.Metal 0x1a2be4724 MTLLibraryDataWithArchive::newFunction(NSString, id<MTLDevice>) + 368 2 libobjc.A.dylib 0x197a49ddc _objc_rootAllocWithZone + 48 1 libsystem_malloc.dylib 0x197c3baf8 _calloc + 88 0 libsystem_malloc.dylib 0x197c4e9bc _malloc_zone_calloc_instrumented_or_legacy + 128 ==== 2 (592 bytes) ROOT LEAK: <_MTLFunctionInternal 0x1325e5550> [448] 1 (144 bytes) _functionQueue --> <dispatch_queue_t (serial) 0x13254c340> [144] "function queue" (from Metal) ``` After: ``` Process: mps_test_metal_library [30687] Path: /Users/USER/*/mps_test_metal_library Load Address: 0x100f74000 Identifier: mps_test_metal_library Version: 0 Code Type: ARM64 Platform: macOS Parent Process: leaks [30686] Date/Time: 2024-12-04 07:57:01.020 -0800 Launch Time: 2024-12-04 07:56:59.030 -0800 OS Version: macOS 15.1.1 (24B2091) Report Version: 7 Analysis Tool: /usr/bin/leaks Physical footprint: 177.2M Physical footprint (peak): 236.5M Idle exit: untracked ---- leaks Report Version: 4.0, multi-line stacks Process 30687: 40691 nodes malloced for 5575 KB Process 30687: 0 leaks for 0 total leaked bytes. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/142053 Approved by: https://github.com/manuelcandales ghstack dependencies: #142052	2024-12-04 21:40:50 +00:00
Nikita Shulga	e8200a507d	[MPS] Fix memory leak (#142052 ) `NSProcessInfo` was allocated inside autorelease pool, but was not added to the pool Test plan: `leaks --atExit -- ./bin/mps_test_print` Before it reported the leaks as follows ``` leaks Report Version: 4.0, multi-line stacks Process 30066: 39595 nodes malloced for 5034 KB Process 30066: 7 leaks for 448 total leaked bytes. STACK OF 1 INSTANCE OF 'ROOT LEAK: <NSProcessInfo>': 29 dyld 0x197a94274 start + 2840 28 mps_test_print 0x10224440c main + 68 27 mps_test_print 0x1022733e4 testing::UnitTest::Run() + 124 26 mps_test_print 0x102273468 bool testing::internal::HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl, bool (testing::internal::UnitTestImpl::)(), char const) + 80 25 mps_test_print 0x102273cac testing::internal::UnitTestImpl::RunAllTests() + 1588 24 mps_test_print 0x102262990 testing::TestSuite::Run() + 1032 23 mps_test_print 0x1022616e4 testing::TestInfo::Run() + 960 22 mps_test_print 0x1022601b8 testing::Test::Run() + 812 21 mps_test_print 0x10226025c void testing::internal::HandleExceptionsInMethodIfSupported<testing::TestSuite, void>(testing::TestSuite, void (testing::TestSuite::)(), char const) + 80 20 mps_test_print 0x102240f88 MPSPrintTest_PrintFloatMatrix_Test::TestBody() + 88 19 mps_test_print 0x1022414f4 torch::randn(c10::ArrayRef<long long>, c10::TensorOptions) + 72 18 libtorch_cpu.dylib 0x10de1cb34 at::_ops::randn::call(c10::ArrayRef<c10::SymInt>, std::__1::optional<c10::ScalarType>, std::__1::optional<c10::Layout>, std::__1::optional<c10::Device>, std::__1::optional<bool>) + 280 17 libtorch_cpu.dylib 0x10de1cf1c at::_ops::randn::redispatch(c10::DispatchKeySet, c10::ArrayRef<c10::SymInt>, std::__1::optional<c10::ScalarType>, std::__1::optional<c10::Layout>, std::__1::optional<c10::Device>, std::__1::optional<bool>) + 152 16 libtorch_cpu.dylib 0x10d9b1078 at::native::randn(c10::ArrayRef<long long>, std::__1::optional<c10::ScalarType>, std::__1::optional<c10::Layout>, std::__1::optional<c10::Device>, std::__1::optional<bool>) + 60 15 libtorch_cpu.dylib 0x10d9b1220 at::native::randn(c10::ArrayRef<long long>, std::__1::optional<at::Generator>, std::__1::optional<c10::ScalarType>, std::__1::optional<c10::Layout>, std::__1::optional<c10::Device>, std::__1::optional<bool>) + 256 14 libtorch_cpu.dylib 0x10e0151f8 at::_ops::normal_::call(at::Tensor&, double, double, std::__1::optional<at::Generator>) + 476 13 libtorch_cpu.dylib 0x10f08ceac c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor& (at::Tensor&, double, double, std::__1::optional<at::Generator>), &at::(anonymous namespace)::(anonymous namespace)::wrapper_MPS__normal_(at::Tensor&, double, double, std::__1::optional<at::Generator>)>, at::Tensor&, c10::guts::typelist::typelist<at::Tensor&, double, double, std::__1::optional<at::Generator>>>, at::Tensor& (at::Tensor&, double, double, std::__1::optional<at::Generator>)>::call(c10::OperatorKernel, c10::DispatchKeySet, at::Tensor&, double, double, std::__1::optional<at::Generator>) + 84 12 libtorch_cpu.dylib 0x10f037674 at::(anonymous namespace)::(anonymous namespace)::wrapper_MPS__normal_(at::Tensor&, double, double, std::__1::optional<at::Generator>) + 72 11 libtorch_cpu.dylib 0x111d8bde8 at::native::normal_mps_(at::Tensor&, double, double, std::__1::optional<at::Generator>) + 132 10 libtorch_cpu.dylib 0x111d8c334 at::native::mps::normal_mps_impl(at::Tensor&, double, double, std::__1::optional<at::Tensor> const&, std::__1::optional<at::Tensor> const&, std::__1::optional<at::Generator>, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>) + 884 9 libtorch_cpu.dylib 0x111d8b8d8 at::Tensor& at::native::mps::random_mps_impl<double>(at::Tensor&, double, double, std::__1::optional<at::Tensor> const&, std::__1::optional<at::Tensor> const&, MPSGraphRandomDistribution, std::__1::optional<at::Generator>, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, MPSGraphTensor (at::native::mps::RandomCachedGraph, MPSGraphTensor) block_pointer) + 2508 8 libtorch_cpu.dylib 0x111d453bc at::native::mps::Placeholder::Placeholder(MPSGraphTensor, at::Tensor const&, NSArray<NSNumber>, bool, MPSDataType, bool) + 5120 7 libtorch_cpu.dylib 0x111d2dbc8 at::mps::MPSDevice::isMacOS13Plus(at::mps::MacOSVersion) const + 404 6 libtorch_cpu.dylib 0x111d2ddf0 at::mps::MPSDevice::isMacOS13Plus(at::mps::MacOSVersion) const::$_0::operator()(int, int) const + 48 5 libobjc.A.dylib 0x197a7b3f4 objc_alloc_init + 80 4 com.apple.Foundation 0x19995fbe4 +[NSProcessInfo alloc] + 112 3 com.apple.Foundation 0x19995faec +[NSProcessInfo allocWithZone:] + 120 2 libobjc.A.dylib 0x197a49ddc _objc_rootAllocWithZone + 48 1 libsystem_malloc.dylib 0x197c3baf8 _calloc + 88 0 libsystem_malloc.dylib 0x197c4e9bc _malloc_zone_calloc_instrumented_or_legacy + 128 ==== 1 (64 bytes) ROOT LEAK: <NSProcessInfo 0x102ce4de0> [64] ``` After test run finishes with no leaks reported ``` Process 29875 is not debuggable. Due to security restrictions, leaks can only show or save contents of readonly memory of restricted processes. Process: mps_test_print [29875] Path: /Users/USER//mps_test_print Load Address: 0x10223c000 Identifier: mps_test_print Version: 0 Code Type: ARM64 Platform: macOS Parent Process: leaks [29874] Date/Time: 2024-12-04 07:43:15.287 -0800 Launch Time: 2024-12-04 07:43:14.400 -0800 OS Version: macOS 15.1.1 (24B2091) Report Version: 7 Analysis Tool: /usr/bin/leaks Physical footprint: 172.0M Physical footprint (peak): 234.1M Idle exit: untracked ---- leaks Report Version: 4.0, multi-line stacks Process 29875: 39508 nodes malloced for 5021 KB Process 29875: 0 leaks for 0 total leaked bytes. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/142052 Approved by: https://github.com/manuelcandales	2024-12-04 21:40:50 +00:00
Pruthvi Madugundu	c0c8f41679	[ROCm] add gfx1101 to wheels (#141667 ) - Remove older ROCm 5.x build condition Pull Request resolved: https://github.com/pytorch/pytorch/pull/141667 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2024-12-04 21:21:29 +00:00
James Wu	760b8ec10a	[easy] Log bypass reasons if we're unable to serialize or deserialize a saved graph (#141911 ) When we fail to deserialize/serialize a graph, we should alert and log it somewhere so that it's debuggable. This can happen in OSS if we use view_replay and encounter an output that requires functional tensor to be serialized to work. Differential Revision: [D66669993](https://our.internmc.facebook.com/intern/diff/D66669993/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/141911 Approved by: https://github.com/oulgen, https://github.com/ezyang	2024-12-04 21:03:32 +00:00
Ke Wen	f24a9d0755	[PGNCCL] Fix behavior of destroy_process_group (#141510 ) Today `destroy_process_group()` is implemented via `ncclCommAbort`. When user call it in CPU, risk is that a healthy NCCL kernel gets preempted, which causes data corruption. Instead of aborting kernels, we should flush collectives in `destroy_process_group`, i.e. let them complete normally, before we tear down resources. This PR implements such "flushing" behavior using `ncclCommFinalize`, then reclaims resources via `ncclCommDestroy`. Expected behaviors: For a bad program, a hang is expected at `destroy_process_group()`. If the PG uses non-blocking communicators, such hang is recoverable, because we attaches a timeout to the flush behavior. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141510 Approved by: https://github.com/wconstab	2024-12-04 20:30:47 +00:00
Svetlana Karslioglu	f7bd0c6b60	[doc] Fix the toctree level (#142008 ) Changing this back 1 in order to not expand on the index.html page. Before: ![Screenshot 2024-12-04 at 11 47 54 AM (2)](https://github.com/user-attachments/assets/40d730ee-61b9-4d60-ab13-9b9075cb3cba) After: ![Screenshot 2024-12-04 at 11 48 30 AM (2)](https://github.com/user-attachments/assets/5eb711a0-e76c-4573-9fdf-88b6b94b31a9) Pull Request resolved: https://github.com/pytorch/pytorch/pull/142008 Approved by: https://github.com/sekyondaMeta, https://github.com/malfet	2024-12-04 19:52:14 +00:00
IvanKobzarev	d552625920	[xla] Update pin to current xla/master (#142065 ) Previous xla pin update was to branch on xla, which I force pushed to remove .torch_pin from xla PR and this commit became non available. Updating to merged xla/master. Pull Request resolved: https://github.com/pytorch/pytorch/pull/142065 Approved by: https://github.com/malfet, https://github.com/huydhn	2024-12-04 19:44:50 +00:00
Ryan Guo	ed77901ec5	[dynamo] Remove workaround for `functools.wraps` in functorch (#142014 ) This is no longer needed after #142000. Fixes #123365. Pull Request resolved: https://github.com/pytorch/pytorch/pull/142014 Approved by: https://github.com/zou3519 ghstack dependencies: #142000	2024-12-04 19:10:46 +00:00
Ryan Guo	f8cb692d77	[dynamo] Simplify handling of `functools.wraps` (#142000 ) Previously when Dynamo encounters a `functools.wrap(...)` call, it would check `VariableTracker.can_reconstruct` and graph break if failed. That has 2 issues: 1. Implementation of `can_reconstruct` is incorrect, since logic of reconstructability isn't necessarily encapsulated in `VariableTracker.reconstruct` -- for some VTs like `CellVariable`, it's also in `SideEffects.codegen_save_tempvars`. This is exposed by #134731. 2. We don't always need to reconstruct the result of `functools.wrap(...)`, for those cases we don't want to give up tracing by an early `con_reconstruct` check. Instead we could just let it fall through, and graph break in the actual `reconstruct` call later, if needed. This patch removes the `can_reconstruct` check altogether. It was introduced in #114279, but the added tests pass even without the check now; this might be because of some recent bug fixing on cells and side effects. Fixes #134731, #141514. Pull Request resolved: https://github.com/pytorch/pytorch/pull/142000 Approved by: https://github.com/zou3519	2024-12-04 19:10:45 +00:00
Aidyn-A	51b7528e27	[ATen][Native][Special] Hermite polynomial prematurely return NaN if n is high (#141955 ) Hermite polynomials diverge to NaN at high orders due to numerical overflow. The proposal is to prematurely return NaN of it is known that at this value it will be NaN. According to my short test ```Python import torch device = "cuda" dtype = torch.float32 x = torch.linspace(-1000, 1000, 100000, device=device, dtype=dtype) for n in range(1024): if torch.special.hermite_polynomial_h(x, n).isnan().sum().item() == x.shape[0]: print(f"hermite_polynomial_h: all outputs are nans! n = {n}") break for n in range(1024): if torch.special.hermite_polynomial_he(x, n).isnan().sum().item() == x.shape[0]: print(f"hermite_polynomial_he: all outputs are nans! n = {n}") break ``` The output values become NaNs at these orders: ``` hermite_polynomial_h: all outputs are nans! n = 53, dtype=torch.float32 hermite_polynomial_he: all outputs are nans! n = 61, dtype=torch.float32 hermite_polynomial_h: all outputs are nans! n = 272, dtype=torch.float64 hermite_polynomial_he: all outputs are nans! n = 304, dtype=torch.float64 ``` Surely, it makes sense to increase the limit as a safety margin. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141955 Approved by: https://github.com/malfet	2024-12-04 18:21:44 +00:00
rzou	c47dae8646	[functional autograd] refactor CopyBackward to be functional (#141719 ) Test Plan: - existing tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/141719 Approved by: https://github.com/soulitzer ghstack dependencies: #141278, #141348	2024-12-04 18:06:31 +00:00
rzou	215f5d77b5	[functional autograd] Refactor validate_outputs into a functional variant (#141348 ) Today, validate_outputs is stateful (it depends on the autograd graph). This PR refactors it into a stateless form that just depends on InputMetadata. Test Plan: - new unittest Pull Request resolved: https://github.com/pytorch/pytorch/pull/141348 Approved by: https://github.com/soulitzer ghstack dependencies: #141278	2024-12-04 18:06:31 +00:00
rzou	2b4f1f4990	[functional autograd] Refactor built-in autograd nodes into functional variants (#141278 ) This PR refactors all builtin autograd nodes (e.g. MulBackward0) from having a single MulBackward0::apply into having: - a "pure function variant" `MulBackward0_apply_functional` - a stateful variant MulBackward0::apply that ends up calling `MulBackward0_apply_functional`. In order to do this we left the stateful pieces in MulBackward0::apply (like unpacking of saved vars, determining which gradients actually need computing). The motivation is that this will be useful for compiled autograd in a future PR. We might refactor this more later, but I wanted to get something reviewed, shipped, and tested in-tree because the entire stack is going to be big and this change by itself might have subtle perf issues. The new codegen looks like the following: - https://gist.github.com/zou3519/84721cfbef71bb640ddf1a64ef8583a3 Here's the old codegen for comparison: - https://gist.github.com/zou3519/73f925fe6aca6dd3ceb0a6e6fcf5f77d Test Plan: - existing tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141278 Approved by: https://github.com/soulitzer	2024-12-04 18:06:31 +00:00
eellison	fd35be2fd3	TritonTemplate dtype fixes (#141991 ) - Set the dtype of "acc" appropriately so that epilogue fusion will have args with dtype - Update dtype propagation to use `type_to_dtype` instead of instantiating tensor - Throw if we have a string arg where we should have a proper CSEVariable, unless we're doing the Modification Subgraph thing which is nyi. everything else is appropriately typed (cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire @chauhang @aakhundov @drisspg ). Pull Request resolved: https://github.com/pytorch/pytorch/pull/141991 Approved by: https://github.com/drisspg ghstack dependencies: #139945, #140057, #141495, #141882	2024-12-04 17:24:23 +00:00
atalman	920e4364b7	[BE] Remove `"$PACKAGE_TYPE" == 'conda'` logic from build scripts (#142019 ) Please see: https://github.com/pytorch/pytorch/issues/138506 Pull Request resolved: https://github.com/pytorch/pytorch/pull/142019 Approved by: https://github.com/huydhn, https://github.com/malfet	2024-12-04 16:05:43 +00:00
drisspg	0582b32f6c	Enable Extension Support (#142028 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/142028 Approved by: https://github.com/ezyang, https://github.com/eqy	2024-12-04 15:54:06 +00:00
PyTorch MergeBot	38d10a1b17	Revert "[Inductor] Represent tiling as a dict (#141751 )" This reverts commit 5deca07c0dcf1482eba99bf93b805cf1cc41ad6c. Reverted https://github.com/pytorch/pytorch/pull/141751 on behalf of https://github.com/atalman due to Failing internal builds ([comment](https://github.com/pytorch/pytorch/pull/141751#issuecomment-2517815899))	2024-12-04 15:43:16 +00:00
drisspg	7830c213d7	[FlexAttention] Fix max-autotune bug with captured buffer grads (#141531 ) # Summary Fix tensor argument ordering for autotuning flex attention, change how we enabled scatters codegen for triton. We used to go through the existing store_output triton codegen but now we just short circuit and generate the correct expression earlier on. This enables us to instead of relying on arg.python_defs to thread arguments through via input_buffers we can instead reuse the exact same mutated buffer infra as we did for multiple outputs before. Test cases added for both default and max-autotune-no-cudagraphs modes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141531 Approved by: https://github.com/Chillee	2024-12-04 14:56:58 +00:00
Tom Ritchford	6ad422d778	`set_linter` finds and replaces built-in set in Python code (#138454 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138454 Approved by: https://github.com/eellison	2024-12-04 14:31:24 +00:00
Edward Z. Yang	7666c8263a	[REFACTOR] Inline FxGraphCache.post_compile into sole call site (#141877 ) I am going to break apart the arguments passed to the constituents to only pass exactly what is needed, so easy access to the insides is helpful here. This also moves two helper functions to output_code.py as well. Also set _boxed_call at constructor. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/141877 Approved by: https://github.com/jamesjwu, https://github.com/jansel Co-authored-by: James Wu <jjwu@meta.com>	2024-12-04 14:21:04 +00:00
IvanKobzarev	f85e238186	[aotd] capture rrelu_with_noise noise mutation in compile (#141867 ) Rebase-copy of long standing already approved PR https://github.com/pytorch/pytorch/pull/138503 that was blocked on landing by xla build issues. Got a new PR with the same content (ghstack checkout was failing due to changed submodules) Corresponding xla PR: https://github.com/pytorch/xla/pull/8363 Pull Request resolved: https://github.com/pytorch/pytorch/pull/141867 Approved by: https://github.com/bdhirsh	2024-12-04 12:18:58 +00:00
Xiaodong Wang	61dc5e9c0a	Enforce contiguity for alltoall (#141816 ) Summary: We cannot relax the alltoall contiguous requirement which will lead to wrong results. Test Plan: Added a test. Differential Revision: D66560930 Pull Request resolved: https://github.com/pytorch/pytorch/pull/141816 Approved by: https://github.com/Skylion007, https://github.com/kwen2501, https://github.com/fduwjj, https://github.com/fegin, https://github.com/yoyoyocmu	2024-12-04 10:17:39 +00:00
Peter Skovorodnikov	eff99a4b4b	fix linalg.SVD docs typo: wrong V* shape in reduced SVD (#142037 ) https://en.wikipedia.org/wiki/Singular_value_decomposition#Reduced_SVDs in reduced SVD V* shape is (n, k) V shape is (n, k) in docs it was wrong Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/142037 Approved by: https://github.com/lezcano	2024-12-04 09:18:33 +00:00
Max Ren	16676fd17b	Disable unused ARM SME to reduce android app binary size (#141942 ) Summary: ARM SME kernels aren't currently used right now, so disabling their build so Reviewed By: digantdesai Differential Revision: D66336599 Pull Request resolved: https://github.com/pytorch/pytorch/pull/141942 Approved by: https://github.com/digantdesai	2024-12-04 07:24:50 +00:00
Jithun Nair	9dffd12f90	Upgrade ROCm wheels to manylinux2_28 - 2 of 2 (binaries) (#141423 ) Depends on https://github.com/pytorch/pytorch/pull/140681 and https://github.com/pytorch/pytorch/pull/141609 Highlights: * Upgrade binaries to ROCm6.2.4 to use latest docker images * Remove pre-cxx11 builds for libtorch on ROCm * Use manylinux2_28 docker images for ROCm * Set `DESIRED_DEVTOOLSET=cxx-abi` (and hence `_GLIBCXX_USE_CXX11_ABI=1`) for ROCm manylinux2_28 wheels (ROCm RHEL8 packages also have GCC_ABI=1, so it keeps it consistent) Pull Request resolved: https://github.com/pytorch/pytorch/pull/141423 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com> Co-authored-by: Pruthvi Madugundu <pruthvigithub@gmail.com>	2024-12-04 07:00:25 +00:00
chunhuanMeng	c0e1fc4919	Avoid casting low precision inputs to high precision for XPU Tensor in `torch.linalg.vector_norm` (#141954 ) Fixes https://github.com/pytorch/pytorch/issues/141953 For mixed precision cases, tensors with device is cpu would cast type to `out_dtype`, while tensors with cuda devices will not do so for computational efficiency. For Intel xpu tensors, low-precision inputs should also not be converted to high-precision (same as cuda). Pull Request resolved: https://github.com/pytorch/pytorch/pull/141954 Approved by: https://github.com/guangyey, https://github.com/ezyang	2024-12-04 06:44:19 +00:00
Yuanhao Ji	75d57b04ec	[Dynamo] Replace `torch._dynamo.optimize()` with `torch.compile()` [9/N] (#140933 ) related commits: - #139706 - #140238 - #140247 - #140253 - #140663 - #140688 - #140922 - #140924 - #140933 > This is the last one Pull Request resolved: https://github.com/pytorch/pytorch/pull/140933 Approved by: https://github.com/ezyang	2024-12-04 06:28:08 +00:00
Nikita Shulga	d6481333ad	[MPS] Add scatter_reduce.two (#141948 ) Which has been request 20+ times on https://github.com/pytorch/pytorch/issues/77764 is just a flavor of out-of-box scatter-reduce, so all this op does is redispatches existing implementation. Unsupported dtype/reduction type combinations: - min/max for int64 - min/max for int32 on MacOS-14 or older Following swift code demonstrates problem with scatterAlongAxis MPS call ```swift import Metal import MetalPerformanceShadersGraph func scatterMPS(device: MTLDevice, inp_buf: MTLBuffer, upd_buf: MTLBuffer, idx_buf: MTLBuffer, out_buf: MTLBuffer, inp_elem: Int, upd_elem: Int) { let graph = MPSGraph() let inputPlaceholder = graph.placeholder(shape: [inp_elem as NSNumber], dataType: .int64, name: nil) let updatesPlaceholder = graph.placeholder(shape: [upd_elem as NSNumber], dataType: .int64, name: nil) let indicesPlaceholder = graph.placeholder(shape: [upd_elem as NSNumber], dataType: .int64, name: nil) let outNode = graph.scatterAlongAxis(0, data: inputPlaceholder, updates: updatesPlaceholder, indices: indicesPlaceholder, mode: .min, name: nil) let mpsInputBuffer = MPSGraphTensorData(inp_buf, shape: [inp_elem as NSNumber], dataType: .int64) let mpsUpdatesBuffer = MPSGraphTensorData(upd_buf, shape: [upd_elem as NSNumber], dataType: .int64) let mpsIndicesBuffer = MPSGraphTensorData(idx_buf, shape: [upd_elem as NSNumber], dataType: .int64) let mpsOutputBuffer = MPSGraphTensorData(out_buf, shape: [inp_elem as NSNumber], dataType: .int64) guard let queue = device.makeCommandQueue() else { fatalError("Can't make queue") } graph.run(with: queue, feeds: [inputPlaceholder: mpsInputBuffer, updatesPlaceholder: mpsUpdatesBuffer, indicesPlaceholder: mpsIndicesBuffer ], targetOperations: nil, resultsDictionary: [outNode: mpsOutputBuffer]) } func makeBufferWithValues(device: MTLDevice, values: [Int64]) -> MTLBuffer { guard let buf = device.makeBuffer(length: values.count * MemoryLayout<Int64>.size, options: [.storageModeShared]) else { fatalError("Can't alloc") } let buf_data = buf.contents().assumingMemoryBound(to: Int64.self) for i in 0..<values.count { buf_data[i] = values[i] } return buf } guard let device = MTLCopyAllDevices().first else { fatalError("Not Metal device found") } print("Using device \(device.name)") let inp_elem = 4 let upd_elem = 4 let inp_buf = makeBufferWithValues(device: device, values: [1, 2, 3, 4]) let upd_buf = makeBufferWithValues(device: device, values: [Int64.max - 1, Int64.max - 2 , Int64.max >> 16 , 11]) let idx_buf = makeBufferWithValues(device: device, values: [0, 1, 2, 3]) guard let out_buf = device.makeBuffer(length:inp_elem * MemoryLayout<Int64>.size, options: [.storageModeShared]) else { fatalError("Can't alloc") } scatterMPS(device: device, inp_buf: inp_buf, upd_buf: upd_buf, idx_buf: idx_buf, out_buf: out_buf, inp_elem: inp_elem, upd_elem: upd_elem) let obuf_data = out_buf.contents().assumingMemoryBound(to: Int64.self) for i in 0..<inp_elem { print("out_buf[\(i)] = \(obuf_data[i])") } ``` that prints `4294967294, 4294967293, 4294967295, 4` instead of expected `1, 2, 3, 4` Where `torch.tensor([[1, 9223372036854775806], [2, 9223372036854775805], [3, 140737488355327], [4, 11]], dtype=torch.int64, device='mps').max(1)` yields an expected results Pull Request resolved: https://github.com/pytorch/pytorch/pull/141948 Approved by: https://github.com/manuelcandales	2024-12-04 04:56:43 +00:00
Sun, Jiayi	deffbbdb91	Update submodule ideep for pd cache changes (#141555 ) Fixes https://github.com/pytorch/pytorch/issues/141327. Fixes https://github.com/pytorch/pytorch/issues/141328. Fixes https://github.com/pytorch/pytorch/issues/141329. Fixes https://github.com/pytorch/pytorch/issues/141330. Fixes https://github.com/pytorch/pytorch/issues/141331. Summary: 1. Modify to_bytes function to include binary_src shape information into the keys of pd cache. 2. Modify inner_product_forward to support broadcast add fusion. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141555 Approved by: https://github.com/jgong5	2024-12-04 04:55:33 +00:00
Howard Huang	e8e65764d1	[pipelining] Improve schedule csv loading (#142009 ) Add small changes based on feedback from Less when testing out https://github.com/pytorch/torchtitan/pull/707 - expose `validate_schedule` as a function - handle spaces around actions in csv file - add error arrow to `_format_pipeline_schedule()` to better show where the step errored Pull Request resolved: https://github.com/pytorch/pytorch/pull/142009 Approved by: https://github.com/lessw2020	2024-12-04 04:15:34 +00:00
Colin L. Rice	86f306b15e	_inductor: Add dynamo_timed for async_compile.precompile and turn on (#141920 ) waitcounters This fixes some review comments from https://github.com/pytorch/pytorch/pull/141379 and gives us another dynamo_timed event for local compilation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141920 Approved by: https://github.com/masnesral	2024-12-04 04:03:46 +00:00
Henry Tsang	30d907c6fb	When serializing treespec context, support enum as well (#141525 ) Following https://github.com/pytorch/pytorch/pull/102716, per @angelayi's suggestion. Note that in general enum as an input is not supported. repro: ``` class TestEnum(enum.Enum): A = auto() B = auto() @staticmethod def from_string(s): return TestEnum[s.upper()] class M(torch.nn.Module): def forward(self, x, en): return x.clone() input1 = ( torch.rand(10, device="cuda"), {TestEnum.A: torch.rand(10, device="cuda")}, ) inputs = [input1] model = M().cuda() _ = model(*input1) ep = torch.export.export(model, input1, strict=False) path = torch._inductor.aot_compile(ep.module(), input1) ``` Differential Revision: D66269157 Pull Request resolved: https://github.com/pytorch/pytorch/pull/141525 Approved by: https://github.com/angelayi	2024-12-04 03:08:50 +00:00
James Wu	288b73cb14	[Redo] Set remote cache version and backend type once in compilation metrics (#141967 ) (Got reverted due to a silly bug, fixed now.) This is causing FbFxGraphRemoteCache.init to no longer be idempotent, i.e. only safe to call once per compile. AOTAutogradCache initializes a new remote cache for the forward and the backward. Technically, we could make AOTAutogradCache smart and globally thread through a single FbFxGraphRemoteCache everywhere. But there's no reason to do so, as this class is just the handle to access the cache. Plus, it's very brittle for FbFxGraphRemoteCache to not be safe to call multiple times Differential Revision: [D66701970](https://our.internmc.facebook.com/intern/diff/D66701970/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/141967 Approved by: https://github.com/laithsakka	2024-12-04 03:07:53 +00:00
Shangdi Yu	7dfb439a2a	Only write predicate once when there are multiple torch.cond (#141528 ) Fixes #140606 TEST PLAN: ``` python test/inductor/test_aot_inductor.py -k cond_share python test/inductor/test_aot_inductor_arrayref.py -k cond_share ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/141528 Approved by: https://github.com/desertfire	2024-12-04 01:56:10 +00:00
cyy	bffaddf9ea	Format caffe2/serialize (#141850 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/141850 Approved by: https://github.com/cpuhrsch	2024-12-04 01:14:24 +00:00
Siddharth Kotapati	941da90e8a	Add macos perf run to the dashboard upload (#141999 ) Adjust the inductor workflow to ensure the macOS perf run gets uploaded Pull Request resolved: https://github.com/pytorch/pytorch/pull/141999 Approved by: https://github.com/huydhn	2024-12-04 01:08:13 +00:00
Jeff Daily	291626fb22	[ROCm] port CK rowwise F8 from fbgemm (#140856 ) This ports (copies) FBGEMM's implementation from @jwfromm. https://github.com/pytorch/FBGEMM/tree/main/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_rowwise Pull Request resolved: https://github.com/pytorch/pytorch/pull/140856 Approved by: https://github.com/drisspg, https://github.com/atalman	2024-12-04 00:32:24 +00:00
Bin Bao	a51a048027	[AOTI][refactor] Move stack allocation related configs (#139093 ) Summary: Move allow_stack_allocation and use_minimal_arrayref_interface configs into the aot_inductor subclass. Test Plan: CI Differential Revision: D65064301 Pull Request resolved: https://github.com/pytorch/pytorch/pull/139093 Approved by: https://github.com/chenyang78	2024-12-04 00:15:19 +00:00
Shangdi Yu	0190d929f2	[BE] Remove unused argument (#141983 ) Summary: As title, the `node_filter` argument is not used. Test Plan: CI Differential Revision: D66712599 Pull Request resolved: https://github.com/pytorch/pytorch/pull/141983 Approved by: https://github.com/tugsbayasgalan	2024-12-04 00:07:33 +00:00
Bob Ren	9286c21b22	Fix fbcode tests for automatic dynamic unspecialize float (#141975 ) Differential Revision: D66708552 Pull Request resolved: https://github.com/pytorch/pytorch/pull/141975 Approved by: https://github.com/bdhirsh, https://github.com/atalman	2024-12-03 23:59:06 +00:00
Brian Hirsh	20912ba582	fix incorrect c10::SymFloat::sqrt (#141728 ) Fixes the silent correctness for SDPA in https://github.com/pytorch/pytorch/issues/141710 Pull Request resolved: https://github.com/pytorch/pytorch/pull/141728 Approved by: https://github.com/Skylion007, https://github.com/ezyang, https://github.com/drisspg ghstack dependencies: #141725	2024-12-03 23:34:16 +00:00
Brian Hirsh	af3e7389ef	guard on flash attention SymFloat scale instead of incorrectly casting to float (#141725 ) Fixes https://github.com/pytorch/pytorch/issues/141710. Previously, if we called flash attention with a `SymFloat` scale that was properly symbolic, we would unsafely cast its raw `SymFloat._data` into a `float`, which is pretty much guaranteed to give `NaN`. This avoids the NaNs in the linked issue, but I'm not sure if it's worth landing yet because we'll start specializing and recompiling for every distinct `scale` that is passed in (which in the dynamic shapes case, is some function of `query.size(-1)`). The real fix would be to ensure that the flash attention (and related) ops all accept a symbolic version of the `scale`. I'm not sure if we should use `SymFloat` or `Scalar` though - more discussion in the issue Pull Request resolved: https://github.com/pytorch/pytorch/pull/141725 Approved by: https://github.com/ezyang	2024-12-03 23:34:16 +00:00
Gregory Comer	da5b281f23	Generate op variants for core CIA ops (#141797 ) There are four core ATen ops with Composite Implicit Autograd (CIA) dispatch: upsample_bilinear2d.vec, upsample_nearest2d.vec, avg_pool1d, and adaptive_avg_pool1d. Op variant auto-generation is currently skipped for CIA ops. In preparation to disable the decompositions for upsample ops by default in export, we need to generate out variants for these ops. This change enables autogen for core-tagged CIA ops, which enables generation of upsample_bilinear2d.vec_out and upsample_nearest2d.vec_out. Test Plan: Added a new test test_functional_variant_autogen_out_variant_core to cover this case in test_codegen.py. Confirmed that upsample_bilinear2d.vec_out and upsample_nearest2d.vec_out op overloads are registered (they were previously not available). Differential Revision: D66590257 Pull Request resolved: https://github.com/pytorch/pytorch/pull/141797 Approved by: https://github.com/larryliu0820	2024-12-03 22:57:46 +00:00
Mwiza Kunda	f0b33658f8	Dont use constant mask if ynumel potentially overflows ygrids (#139751 ) If (ynumel / YBLOCK) > get_max_ygrids(), the z dimension will be used if znumel is None. However, if (ynumel / YBLOCK) % get_max_ygrids() != 0, there will be program launches with inputs that require masking, and so this needs to be considered when determining if the y dimension has a constant mask. Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/139751 Approved by: https://github.com/eellison Co-authored-by: George White <georgew@graphcore.ai>	2024-12-03 22:56:18 +00:00
Colin L. Rice	cc98a1b599	_inductor: Add WaitCounter for triton.compile calls. (#141379 ) _inductor: Add WaitCounter for async_compile.wait calls. This will start recording how long these async_compile.wait calls take. Note that we want to just unify dynamo_timed in the long term. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141379 Approved by: https://github.com/oulgen, https://github.com/masnesral	2024-12-03 22:56:04 +00:00
Johnson Wong	f86a1753d1	Add option to split Linear gates for Quantizable LSTM into separate ops (#141366 ) Add option to split Linear gates for Quantizable LSTM into separate ops (#141366) Summary: Reattempt to land D65283170, adding pyre-fixmes / mypy ignores following D52890934 For LSTM, the input and hidden state are projected with Linear layers to construct the 4 gates. This is typically performed together as a single Linear (for each state) with output channel count `4 * hidden_dim` for efficiency. https://www.internalfb.com/code/fbsource/[ebef7c4238aa55948b2b444044f2c8ed2040de55]/fbcode/caffe2/torch/ao/nn/quantizable/modules/rnn.py?lines=52-58 The output is then ultimately split into 4: https://www.internalfb.com/code/fbsource/[ebef7c4238aa55948b2b444044f2c8ed2040de55]/fbcode/caffe2/torch/ao/nn/quantizable/modules/rnn.py?lines=83-87 For on-device latency (and possibly memory) considerations, we want to avoid constructing the intermediate `gates` tensor (which can be relatively large), by splitting `igates` and `hgates` first (as 4x `Linear(hidden_dim, hidden_dim)` each), applying add separately, then proceeding as usual. This functionality can be enabled by specifying `split_gates=True` (default False is original behavior) at any entry point (directly with `torch.ao.nn.quantizable.LSTM` or via `_get_lstm_with_individually_observed_parts`). Test Plan: piggy back on existing test to check for correct swap handling, numerics, and jit.script during prepare/convert ``` buck2 test 'fbcode//mode/opt' fbcode//caffe2/test/quantization:test_quantization -- --exact 'caffe2/test/quantization:test_quantization - test_custom_module_lstm (caffe2.test.quantization.core.test_quantized_op.TestQuantizedOps)' ``` https://www.internalfb.com/intern/testinfra/testrun/4503599884152725 This test is quite long running now (more than double original). --- shorter test to confirm original `LSTMCell` passes ``` buck2 test 'fbcode//mode/opt' fbcode//caffe2/test:quantization_fx -- --exact 'caffe2/test:quantization_fx - test_static_lstm_with_custom_fixed_qparams (quantization.fx.test_quantize_fx.TestQuantizeFx)' ``` https://www.internalfb.com/intern/testinfra/testrun/11258999127933996 Reviewed By: Ninja91 Differential Revision: D66380336	2024-12-03 17:21:44 -05:00
angelayi	80705d3abf	Convert assert to torch._check in MHA (#141918 ) Fixes https://github.com/pytorch/pytorch/issues/139610 Pull Request resolved: https://github.com/pytorch/pytorch/pull/141918 Approved by: https://github.com/ezyang	2024-12-03 21:58:02 +00:00
Aaron Orenstein	5303af2d27	Structured compile_fx (#141505 ) - Turn fx_codegen_and_compile() into a class (FxCompile) so we can override the implementation. - Pull the current body into an implementation (_InProcessFxCompile) which just performs the existing behavior. - Add an async interface. (See below) The intended future behavior of Async Compile will be to allow dynamo functions to start compiling in the background (and on a separate machine) while we continue to run eager in the foreground. As such we'll need to put the compilation behind some kind of Future implementation - it makes sense to reuse the existing python futures for that. An async function is just a syntactic way to return an asyncio.Future. Because asyncio.run() adds confusion to the stack traces when the called function isn't actually being used in an asynchronous way we also provide a synchronous interface which can be directly called. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141505 Approved by: https://github.com/ezyang ghstack dependencies: #141502	2024-12-03 21:27:32 +00:00
Aaron Orenstein	02147fe0f9	codecache: pull out some Graph serialization code into common helpers (#141502 ) Moved some code from FxGraphCache.lookup_graph() which dealt with serializing and deserializing CompiledFxGraph into CompiledFxGraph itself so it can be reused later by Async Compile. Async Compile will need to serialize the compiled CompiledFxGraph from one process and deserialize it in another - so it's very similar to the cache. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141502 Approved by: https://github.com/ezyang	2024-12-03 21:27:32 +00:00
Kurt Mohler	8e9873d0a3	Allow attribute mutation for `MutableMappingVariable` (#141376 ) Fixes #141375 Pull Request resolved: https://github.com/pytorch/pytorch/pull/141376 Approved by: https://github.com/vmoens	2024-12-03 21:00:10 +00:00
Chris Sidebottom	b4ea913978	Check /var/lib/jenkins/workspace exists before setting permissions (#141767 ) Currently, if you run these CI scripts in a non-jenkins environment, they fail due to the folder not existing. This ensures the CI scripts can be re-used in different runners. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141767 Approved by: https://github.com/seemethere, https://github.com/malfet	2024-12-03 20:56:20 +00:00
cyy	7c1d5db1f3	[2/N] Enable UBSAN tests (#141740 ) Apply c10::load in more places. The function was introduced to cast a byte to valid boolean values, thus fixing the UBSAN errors. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141740 Approved by: https://github.com/ezyang	2024-12-03 20:52:26 +00:00
Darshan Sanghani	28efc17d2c	[pytorch/profiler] Honor escape quotes arg in a profiler metadata log formatter (#141527 ) (#141626 ) Summary: We were ignoring the with_escaped_quotes param in format_list inline function iin utils.cpp in the case where we had to truncate a list of more than kTruncatelength items. In that case we would truncate a list into a string but always return it with an escaped quotes wrapping it. this will cause issues if this string is meant to be added to other lists which will also go through formatting. Leading to cases like `"["[a, b, c, ...]"]"`. now the above will be well formatted as `"[[a, b, c, ...]]"` as the escape quote requests will be honored. Differential Revision: D66521676 Pull Request resolved: https://github.com/pytorch/pytorch/pull/141626 Approved by: https://github.com/sraikund16	2024-12-03 20:42:57 +00:00
Sam Larsen	78e53a92c3	Remove monkeypatch of has_frozen_params in test/inductor/test_codecache.py (#141898 ) Summary: This particular test isn't really needed since the code path is already exercised in `test_freezing`. While I was here, I beefed up testing in that method to consider whether the frozen paramater is inlinable vs. not since the caching behavior is different. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141898 Approved by: https://github.com/ezyang, https://github.com/jansel	2024-12-03 20:38:10 +00:00
drisspg	42547f8d48	Add support for blackwell codegen (#141724 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/141724 Approved by: https://github.com/Skylion007, https://github.com/malfet, https://github.com/eqy	2024-12-03 20:34:43 +00:00
Mu-Chu Lee	8b0fcad0fd	[AOTInductor] Add update_constant_buffer pybind support (#140755 ) Summary: We add update_constant_buffer python support for testing purpose. Test Plan: Included in commit Differential Revision: D65968613 Pull Request resolved: https://github.com/pytorch/pytorch/pull/140755 Approved by: https://github.com/22quinn	2024-12-03 20:34:25 +00:00
Ting Lu	e5f5283ab2	Fix cuda arch full version for 12.6 (#141976 ) follow up for https://github.com/pytorch/pytorch/pull/141433/files build still showing up as 12.6.2 in the name, see latest https://github.com/pytorch/pytorch/actions/runs/12134985224/job/33833276884. related to https://github.com/pytorch/pytorch/issues/138440 Pull Request resolved: https://github.com/pytorch/pytorch/pull/141976 Approved by: https://github.com/atalman, https://github.com/nWEIdia, https://github.com/Skylion007	2024-12-03 20:33:01 +00:00
Fabian Keller	f472b3aee1	improve typings around torch.export (#141829 ) This is another follow-up to https://github.com/pytorch/pytorch/pull/115074 / https://github.com/pytorch/pytorch/pull/141240 following the strategy discussed there (https://github.com/pytorch/pytorch/pull/115074#issuecomment-2480992230). This PR improves the type annotations around `torch._export`. Even though the PR introduces a few runtime type asserts, the runtime behavior should stay equivalent, because the failed assertions should have been immediate crashes anyway. CC @Skylion007 @ezyang Pull Request resolved: https://github.com/pytorch/pytorch/pull/141829 Approved by: https://github.com/ezyang	2024-12-03 19:57:21 +00:00
Bob Ren	43c5f59190	flip capture_autograd_function to default to true and warn if false (#141972 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/141972 Approved by: https://github.com/zou3519 ghstack dependencies: #141932	2024-12-03 19:50:14 +00:00
angelayi	96a35716d1	[aoti] Improve OSSProxyExecutor error messages (#141501 ) For debugging issues like https://fb.workplace.com/groups/1028545332188949/permalink/1092584242451724/ Pull Request resolved: https://github.com/pytorch/pytorch/pull/141501 Approved by: https://github.com/henrylhtsang	2024-12-03 19:32:49 +00:00
Colin L. Rice	6b620423a3	dynamo_timed: Add a log_waitcounter option. (#141402 ) This logs a waitcounter of the name pytorch.dynamo_timed.{key}. Primarily sending this now to make sure everyone likes the API, then I'll add tests, and migrate one dynamo_timed to use it. (likely starting with https://github.com/pytorch/pytorch/pull/141379). Testing is a bit harder, since we don't normally have any way to read _WaitCounter state AFAICT. I want to poke around and see if I can figure out a way to read the state, otherwise I'll just mock it to at least make sure it's mostly working. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141402 Approved by: https://github.com/jamesjwu, https://github.com/masnesral	2024-12-03 19:24:29 +00:00
drisspg	d35358b271	[FlexAttention] Remove failing num_warps=8 in bwds (#141653 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/141653 Approved by: https://github.com/BoyuanFeng	2024-12-03 19:22:52 +00:00
dan_the_3rd	9125e9119c	Fix memory leak in `ModuleTracker` (#141960 ) Thanks @drisspg and @albanD for finding the fix TEST PLAN ``` import gc import torch import torch.nn as nn from torch.utils.module_tracker import ModuleTracker class MyModel(nn.Module): def forward(self, x): return x * x print(f"torch=={torch.__version__}") m = MyModel() m.cuda() m.to(torch.bfloat16) mt = ModuleTracker() for i in range(1000): if i % 100 == 0: gc.collect() print("memory_allocated:", torch.cuda.memory_allocated()) x = torch.randn([128, 256], device="cuda", dtype=torch.bfloat16, requires_grad=True) with mt: m(x) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/141960 Approved by: https://github.com/albanD	2024-12-03 18:36:15 +00:00
iupaikov-amd	7bb2228ffd	Test cpp_wrapper_hipify string comparison (#141353 ) Updating the test to match this code that takes device warpsize into account: `cf1d95a965/torch/_inductor/codegen/cuda/device_op_overrides.py (L120)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/141353 Approved by: https://github.com/desertfire	2024-12-03 18:25:32 +00:00
Chien-Chin Huang	8b5c26287d	Initialize lr as a tensor if it is originally a tensor (#141620 ) Fix https://github.com/pytorch/pytorch/issues/139575 Pull Request resolved: https://github.com/pytorch/pytorch/pull/141620 Approved by: https://github.com/kwen2501	2024-12-03 18:10:23 +00:00
Uttam Thakore	314e08eb52	[fr_trace][bugfix] Log missing ranks when provided (#141924 ) Summary: For missing ranks issues, `build_collectives` doesn't log any errors (`5c2584a14c/tools/flight_recorder/components/builder.py (L293C23-L306C24)`), which means that when `EntryState.to_collective` is called [here](`5c2584a14c/tools/flight_recorder/components/builder.py (L400C21-L405C22)`), errors will be empty and `to_collective` will enter the first if statement. But that codepath doesn't log `missing_ranks`, meaning it will be absent from the `Collective` returned. This diff fixes that oversight. Test Plan: eyes Sandcastle run Differential Revision: D66679224 Pull Request resolved: https://github.com/pytorch/pytorch/pull/141924 Approved by: https://github.com/c-p-i-o	2024-12-03 17:54:43 +00:00
Andrew Gu	5c59f4a55a	Remove old FSDP1 `fully_shard` (#141875 ) FSDP1's `fully_shard` frontend was an exploration at the end of 2022 H2 as part of the `torch/distributed/_composable` APIs to avoid `nn.Module` wrappers. It calls into the same backend code as FSDP1's `FullyShardedDataParallel`. The API did not gain traction internally, so we instead reused the name `fully_shard` for FSDP2, which similarly is not an `nn.Module` wrapper and follows similar design principles as FSDP1's `fully_shard`. To the best of our knowledge, we have removed all instances of FSDP1's `fully_shard` internally, and we put the deprecation warning in open source in 2.4 saying it will be removed after 2.5 (which is now): `4959784dac/torch/distributed/_composable/fully_shard.py (L40-L48)` We are skipping the PR sanity check because this PR is only removing code, not adding new code, and should not require this sanity check. Differential Revision: [D66664988](https://our.internmc.facebook.com/intern/diff/D66664988) Pull Request resolved: https://github.com/pytorch/pytorch/pull/141875 Approved by: https://github.com/weifengpy	2024-12-03 17:00:47 +00:00
rzou	ed4831b93c	Improve torch.library.opcheck and register_autograd docs (#141883 ) Fixes https://github.com/pytorch/pytorch/issues/141618 Pull Request resolved: https://github.com/pytorch/pytorch/pull/141883 Approved by: https://github.com/albanD ghstack dependencies: #141894, #141880	2024-12-03 16:28:56 +00:00
rzou	827c322290	Make torch.library.triton_op public (#141880 ) We've been using it privately for half a year and everything's been good. This PR: 1. Makes torch.library.triton_op public 2. Renames capture_triton -> wrap_triton. We got feedback that no one knew what "capture triton" does. 3. Makes torch.library.wrap_triton public. triton_op is used to construct a Python custom operator that may call 1+ triton kernels. Each of those triton kernels must be annotated with wrap_triton. Test Plan: - existing tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/141880 Approved by: https://github.com/albanD ghstack dependencies: #141894	2024-12-03 16:28:56 +00:00
rzou	ac600fdce6	Type exposed_in decorator (#141894 ) Test Plan: - lintrunner Pull Request resolved: https://github.com/pytorch/pytorch/pull/141894 Approved by: https://github.com/albanD	2024-12-03 16:28:17 +00:00
Nikita Shulga	7a806a839d	[FP8] Expand MaskedSelect to float8 (#141928 ) Needed for printing those. Though I wonder if better solution would be to change those ops to use element size rather than actual type (to extend them automatically to unsigned integral types as well) Pull Request resolved: https://github.com/pytorch/pytorch/pull/141928 Approved by: https://github.com/ezyang, https://github.com/jgong5	2024-12-03 15:14:26 +00:00
Xuehai Pan	78543e6002	[dynamo][pytree][1/N] make CXX pytree traceable: `tree_iter` / `tree_leaves` (#137397 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137397 Approved by: https://github.com/jansel	2024-12-03 11:17:39 +00:00
Valentine233	9990b47ea3	[inductor][pattern matcher] revise mkldnn pattern matcher UT (#141334 ) Fixes #139970, #139812. Revise mkldnn pattern matcher UTs, to check the relevant specific matched patterns instead of the total matched number. 1) Add the missing specific counters in pattern matchers, e.g. `mkldnn_unary_fusion_matcher_nodes`/`mkldnn_conv_weight_pack_matcher_count`. 2) In UTs, change the general `matcher_count`/`matcher_nodes` checks to the specific ones, e.g. `mkldnn_unary_fusion_matcher_nodes`/`mkldnn_conv_weight_pack_matcher_count`. 3) In UTs, remove the option of `matcher_count`/`matcher_nodes` params in _test_common and make `matcher_check_fn` a necessary param. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141334 Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5, https://github.com/jansel	2024-12-03 09:26:43 +00:00
Ryan Guo	ff73e2e679	[dynamo] Validate `mutation_type` and `source` in `VariableTracker.__init__` (#141717 ) As title, this also uncovered a few invalid use cases; the cases that cause error are fixed in separate patches prior to this patch, and the rest are fixed in this patch. This patch also moves a few `.source` mutation to variable construction, to increase the coverage of the validation. Fixes #133027. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141717 Approved by: https://github.com/jansel ghstack dependencies: #141713, #141714, #141715, #141902, #141716	2024-12-03 09:18:06 +00:00
Ryan Guo	0efd184685	[dynamo] Fix side effects for range iterator that escapes the graph (#141716 ) `wrap_range_iterator` mistakenly used `ValueMutationNew`, when it should've used `ValueMutationExisting`, because this code path always has a source. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141716 Approved by: https://github.com/jansel ghstack dependencies: #141713, #141714, #141715, #141902	2024-12-03 09:18:06 +00:00
Ryan Guo	7c3c8a662e	[dynamo] Add `RANGE_ITERATOR_MATCH` to properly guard on range iterators (#141902 ) A subsequeunt patch attempts to fix a side-effect issue for range iterators, which in turn exposed an exising issue on guards for range iterators -- the following test started failing: ``` PYTORCH_TEST_WITH_DYNAMO=1 python test/test_tensor_creation_ops.py TestTensorCreationCPU.test_hstack_column_stack_cpu_int16 ``` This patch adds a `RANGE_ITERATOR_MATCH` guard to make sure that we properly guard on range iterators, and adds a regression test. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141902 Approved by: https://github.com/jansel ghstack dependencies: #141713, #141714, #141715	2024-12-03 09:18:06 +00:00
Ryan Guo	ff3f4a164c	[dynamo] Fix aliasing issue for `dict.copy` that escapes the graph (#141715 ) Dynamo accidentally passed the original `ConstDictVariable.source` to the result of `dict.copy(...)`, which caused aliasing issue when the result escapes the graph (e.g., is a return value). This patch fixes that and adds a regression test. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141715 Approved by: https://github.com/jansel ghstack dependencies: #141713, #141714	2024-12-03 09:18:06 +00:00
Ryan Guo	9eb0520d75	[dynamo] Fix side-effect handling for pre-existing `collections.deque` (#141714 ) Previously we never replayed side effects to `DequeVariable` with a source; the bug was already in the `test_deque_input` test, but went unnoticed because we didn't check the deque objects. This patch adds limited but practical support for this (see comments in `side_effects.py` for why limited), and updates the deque tests to check for this. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141714 Approved by: https://github.com/jansel ghstack dependencies: #141713	2024-12-03 09:18:06 +00:00
Ryan Guo	f2ce2d435b	[dynamo] Add test for returning a nested recursive function and update documentation (#141713 ) Addresses https://github.com/pytorch/pytorch/pull/137905#discussion_r1806923085. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141713 Approved by: https://github.com/jansel	2024-12-03 09:18:06 +00:00
Mwiza Kunda	f8a64c324e	Broadcast constants on vectorised stores in `CppTile2DKernel` (#140262 ) Currently constants are not broadcasted on vectorised stores in `CppTile2DKernel`. This leads to errors like the following: ```shell error:: request for member 'store' in 'tmp1', which is of non-class type 'signed char' 61 \| tmp1.store(tmp2 + static_cast<int64_t>(8L*x0_inner), static_cast<int64_t>(8)); \| ^~~~~ ``` This PR adds the required broadcasting. Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/140262 Approved by: https://github.com/jgong5	2024-12-03 09:15:17 +00:00
Bob Ren	e1e3bbc2e1	Set capture_autograd_function=False by default (#141932 ) https://github.com/pytorch/pytorch/pull/136959 cleaned up the flag and added a warning. @Chillee pointed out that we should really default this flag to false otherwise we subject all users that go down this code path to log spew. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141932 Approved by: https://github.com/jansel	2024-12-03 07:59:03 +00:00
Nikita Shulga	e499b46465	Speed up half tensors printing (#141927 ) This PR removes copycast of reduced precision types to float before printing, that was added in https://github.com/pytorch/pytorch/pull/14418 to probably unblock printing when many operations, like `is_nan` and `max` were not supported on CPUs (Reusing old test plan) Before the PR: ```python In [1]: import torch; a = torch.rand(1, 1700, 34, 50, dtype=torch.float16) In [2]: %timeit str(a) 621 μs ± 5.06 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each) ``` after the PR ```python In [1]: import torch; a = torch.rand(1, 1700, 34, 50, dtype=torch.float16) In [2]: %timeit str(a) 449 μs ± 2.34 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each) ``` Also, this allows one printing 15Gb Metal tensors on 32GB Mac machine: ``` % python3 -c "import torch;print(torch.empty(72250,72250, device='mps', dtype=torch.float16))" tensor([[0., 0., 0., ..., 0., 0., 0.], [0., 0., 0., ..., 0., 0., 0.], [0., 0., 0., ..., 0., 0., 0.], ..., [0., 0., 0., ..., 0., 0., 0.], [0., 0., 0., ..., 0., 0., 0.], [0., 0., 0., ..., 0., 0., 0.]], device='mps:0', dtype=torch.float16) ``` Before this change it failed with non-descriptive ``` % python3 -c "import torch;print(torch.empty(72250,72250, device='mps', dtype=torch.float16))" Traceback (most recent call last): File "<string>", line 1, in <module> import torch;print(torch.empty(72250,72250, device='mps', dtype=torch.float16)) ~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/malfet/git/pytorch/pytorch/torch/_tensor.py", line 568, in __repr__ return torch._tensor_str._str(self, tensor_contents=tensor_contents) ~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/malfet/git/pytorch/pytorch/torch/_tensor_str.py", line 708, in _str return _str_intern(self, tensor_contents=tensor_contents) File "/Users/malfet/git/pytorch/pytorch/torch/_tensor_str.py", line 625, in _str_intern tensor_str = _tensor_str(self, indent) File "/Users/malfet/git/pytorch/pytorch/torch/_tensor_str.py", line 339, in _tensor_str self = self.float() RuntimeError: Invalid buffer size: 19.45 GB ``` Convert fp8 dtypes to float16, as float range is an overkill Pull Request resolved: https://github.com/pytorch/pytorch/pull/141927 Approved by: https://github.com/ezyang	2024-12-03 07:01:49 +00:00
Xiaozhu Meng	d035db3d86	[AMD] [submodule] aten.bmm CK-backend prototype (#140758 ) Summary: Early prototype of adding CK backend for aten.bmm. Currently, it is very limited in that: 1. BF16 only 2. A single CK instance 3. NT layout only 4. Alpha=1, Beta=0 only Reviewed By: xw285cornell, zjing14 Differential Revision: D65954695 Pull Request resolved: https://github.com/pytorch/pytorch/pull/140758 Approved by: https://github.com/bradleyhd	2024-12-03 06:54:51 +00:00
Edward Z. Yang	6afcec0c58	Assert is GraphModule in compile_fx_aot (#141575 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/141575 Approved by: https://github.com/Skylion007, https://github.com/desertfire	2024-12-03 05:39:44 +00:00
PyTorch MergeBot	ce86119503	Revert "Set remote cache version and backend type once in compilation metrics (#141707 )" This reverts commit d633cf1f55f87e5536f63981357d543ac46e48f7. Reverted https://github.com/pytorch/pytorch/pull/141707 on behalf of https://github.com/malfet due to It breaks tests by referencing FbRemoteFxGraphCache, but CI was green ([comment](https://github.com/pytorch/pytorch/pull/141707#issuecomment-2513555185))	2024-12-03 05:01:02 +00:00
PyTorch MergeBot	2999dbfd21	Revert "[REFACTOR] Inline FxGraphCache.post_compile into sole call site (#141877 )" This reverts commit 3ab4a28eaa7dc67d5c46c2016bbfe9932b36de06. Reverted https://github.com/pytorch/pytorch/pull/141877 on behalf of https://github.com/huydhn due to Job are failing en masse after this lands, so it looks like a land race ([comment](https://github.com/pytorch/pytorch/pull/141877#issuecomment-2513552752))	2024-12-03 04:57:58 +00:00
Nikita Shulga	38bbe37187	Enable CI on SM89 (#140305 ) Using EC2 G6 instance, based on NVIDIA L4, added to scale config in https://github.com/pytorch/test-infra/pull/5376 To enable more balanced sharding, had to push `148ae19935` Added `@xfailIfSM89` to the following tests: - test_fp8_pattern_2 - test_original_aten_preserved_split_addmm - test_sparse_semi_structured_scaled_mm - test_sparse_semi_structured_scaled_mm_fp8 - test_sparse_fp8fp8_mm Increased tolerance to 2e-4 for `RNNTest.BidirectionalMultilayerGRU_CPU_vs_CUDA` Skipped following inductor tests (that either flaky OOMs or timeouts): - test_reduction_fn_std_float64 - test_reduction_fn_var_mean_float64 - test_multi_output_unbacked_custom_op Pull Request resolved: https://github.com/pytorch/pytorch/pull/140305 Approved by: https://github.com/wdvr, https://github.com/ZainRizvi	2024-12-03 04:49:46 +00:00
chilli	af88326250	Ensure that BlockMask length must always exactly match the sequence length in flex_attention (#141625 ) Fixes https://github.com/pytorch/pytorch/issues/141435 Pull Request resolved: https://github.com/pytorch/pytorch/pull/141625 Approved by: https://github.com/drisspg ghstack dependencies: #138788	2024-12-03 04:45:05 +00:00
Yidi Wu	9cfc9e636d	[while_loop] change to guard_equals for checking output and carry (#141734 ) The input with the same can be represented with different symbols e.g. ```python def body_fn(a, b): return b.sin(), a.sin() ``` , where a = torch.randn(3, 4), b= torch.randn(3, 4). There could be 4 symbols allocated for a and b. So instead of checking their shapes and strides' symbol must be the same, we just use guard_equals to enforce the constraint. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141734 Approved by: https://github.com/zou3519, https://github.com/eellison	2024-12-03 04:00:21 +00:00
Thomas Bohnstingl	871b93bc59	[associative_scan] Fixing shape checks (#141698 ) This PR fixes the shape checks that are done in the associative_scan operation. Before all shapes of the input leaves were required to be the same. With this PR only the shapes of the output of the combine_fn and the input leaves need to be the same, but not among the input leaves. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141698 Approved by: https://github.com/ydwu4	2024-12-03 03:49:11 +00:00
Edward Z. Yang	3ab4a28eaa	[REFACTOR] Inline FxGraphCache.post_compile into sole call site (#141877 ) I am going to break apart the arguments passed to the constituents to only pass exactly what is needed, so easy access to the insides is helpful here. This also moves two helper functions to output_code.py as well. Also set _boxed_call at constructor. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/141877 Approved by: https://github.com/jamesjwu, https://github.com/jansel Co-authored-by: James Wu <jjwu@meta.com>	2024-12-03 03:48:23 +00:00
Mikayla Gawarecki	ecbb8a8800	Mention version of flip in weights_only error message (#141304 ) Fixes https://github.com/pytorch/pytorch/issues/141139 How the 3 versions of the error message now look ### Version 1 Old error message: ``` _pickle.UnpicklingError: Weights only load failed. This file can still be loaded, to do so you have two options, do those steps only if you trust the source of the checkpoint. (1) Re-running `torch.load` with `weights_only` set to `False` will likely succeed, but it can result in arbitrary code execution. Do it only if you got the file from a trusted source. (2) Alternatively, to load with `weights_only=True` please check the recommended steps in the following error message. WeightsUnpickler error: Unsupported global: GLOBAL __main__._rebuild_class_that_uses_build_instruction was not an allowed global by default. Please use `torch.serialization.add_safe_globals([_rebuild_class_that_uses_build_instruction])` or the `torch.serialization.safe_globals([_rebuild_class_that_uses_build_instruction])` context manager to allowlist this global if you trust this class/function. Check the documentation of torch.load to learn more about types accepted by default with weights_only https://pytorch.org/docs/stable/generated/torch.load.html. ``` New error message: ``` _pickle.UnpicklingError: Weights only load failed. This file can still be loaded, to do so you have two options, do those steps only if you trust the source of the checkpoint. (1) In PyTorch 2.6, we changed the default value of the `weights_only` argument in `torch.load` from `False` to `True`. Re-running `torch.load` with `weights_only` set to `False` will likely succeed, but it can result in arbitrary code execution. Do it only if you got the file from a trusted source. (2) Alternatively, to load with `weights_only=True` please check the recommended steps in the following error message. WeightsUnpickler error: Unsupported global: GLOBAL __main__._rebuild_class_that_uses_build_instruction was not an allowed global by default. Please use `torch.serialization.add_safe_globals([_rebuild_class_that_uses_build_instruction])` or the `torch.serialization.safe_globals([_rebuild_class_that_uses_build_instruction])` context manager to allowlist this global if you trust this class/function. Check the documentation of torch.load to learn more about types accepted by default with weights_only https://pytorch.org/docs/stable/generated/torch.load.html. ```` ### Version 2 Old error message: ``` _pickle.UnpicklingError: Weights only load failed. ``torch.nested`` and ``torch._dynamo`` must be imported to load nested jagged tensors (NJTs) ``` New error message: ``` _pickle.UnpicklingError: Weights only load failed. ``torch.nested`` and ``torch._dynamo`` must be imported to load nested jagged tensors (NJTs) In PyTorch 2.6, we changed the default value of the `weights_only` argument in `torch.load` from `False` to `True`. Re-running `torch.load` with `weights_only` set to `False` will likely succeed, but it can result in arbitrary code execution. Do it only if you got the file from a trusted source. ``` ### Version 3 Old error message ``` _pickle.UnpicklingError: Weights only load failed. Re-running `torch.load` with `weights_only` set to `False` will likely succeed, but it can result in arbitrary code execution. Do it only if you got the file from a trusted source. Trying to load unsupported GLOBAL posix.execv whose module posix is blocked. Check the documentation of torch.load to learn more about types accepted by default with weights_only https://pytorch.org/docs/stable/generated/torch.load.html. ``` New error message ``` _pickle.UnpicklingError: Weights only load failed. In PyTorch 2.6, we changed the default value of the `weights_only` argument in `torch.load` from `False` to `True`. Re-running `torch.load` with `weights_only` set to `False` will likely succeed, but it can result in arbitrary code execution. Do it only if you got the file from a trusted source. Trying to load unsupported GLOBAL posix.execv whose module posix is blocked. Check the documentation of torch.load to learn more about types accepted by default with weights_only https://pytorch.org/docs/stable/generated/torch.load.html. ```` Pull Request resolved: https://github.com/pytorch/pytorch/pull/141304 Approved by: https://github.com/zou3519	2024-12-03 03:26:27 +00:00
Michal Gallus	4cbb3b4bd2	[ROCm] Enable finding HIP and ROCm libraries on Windows (#137279 ) This PR introduces support for finding HIP-SDK Libraries on Windows. Since reading the code changes using the diff view is a bit cumbersome due to introduced if branch, let me explain what was changed: - The linux-specific steps to find HIP packages have been dragged into `if(UNIX) block` - Windows steps follow in the `else()` clause The separation was needed, because of several factors: - HIP SDK for Windows typically names its components using `hip` in their names (for exmaple: `hip_version.h` instead of `rocm_version.h`, `HIP_VERSION_DEV_MAJOR` instead of `ROCM_VERSION_DEV_MAJOR`, etc.), - The libraries included in HIP SDK are only a subset of what is available in Linux ROCm (missing hsa-rt, rccl, roctx) - MIOpen isn't a part of HIP SDK, but can be built separately and as of now requires additional path to be defined using and env var. - Windows can only find hip package in version greater than 1.0 and its libraries if the lowercase `find_package(hip ...)` is invoked first. This is because the lowercase `hip` name will cause the mechanism to find hip's packages using [config mode](https://cmake.org/cmake/help/latest/command/find_package.html#search-modes) which is the only one supported on Windows, assuming we also want to [include its libraries](https://rocm.docs.amd.com/en/latest/conceptual/cmake-packages.html#consuming-the-hip-api-in-c-code). The upper-case module-mode-seearched `find_package(HIP)` is used later for inclusion of macros such as `hip_add_library` and related macros. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137279 Approved by: https://github.com/jeffdaily	2024-12-03 03:26:01 +00:00
eellison	33573488d0	Make Dtypepropagation singleton (#141882 ) Should fix compile time regression, it was doing fairly expensive meta programming in init and being instantiated multiple times. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141882 Approved by: https://github.com/ezyang ghstack dependencies: #139945, #140057, #141495	2024-12-03 03:15:16 +00:00
Benjamin Glass	f911361de1	Correctly specify size of sparse_csr tensors in maskedtensor binary ops (#134335 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134335 Approved by: https://github.com/amjames, https://github.com/cpuhrsch	2024-12-03 02:55:57 +00:00
Aaron Gokaslan	08db735629	[BE]: Update mypy to 1.13.0 (#140808 ) Update mypy to 1.13.0 . Should hopefully reduce linting time. Has support for orjson cache serialization which should improve mypy cache perf if orjson is installed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/140808 Approved by: https://github.com/ezyang, https://github.com/malfet	2024-12-03 02:50:10 +00:00
Guilherme Leobas	34127fc688	Only reconstruct dict if needed (#141606 ) Fixes #141452 This is a follow-up of PR #134876, which optimized dict reconstruct to codegen only if any value changed. In this PR we cover the general case and do not codegen any instruction if the dictionary remains the same. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141606 Approved by: https://github.com/zou3519	2024-12-03 02:22:34 +00:00
Tugsbayasgalan (Tugsuu) Manlaibaatar	a6bea3d86d	Fix DCe in training IR to reflect correct record function op (#141899 ) Summary: The exit function is actually exit._recordFunction not exit.default Test Plan: CI Differential Revision: D66665359 Pull Request resolved: https://github.com/pytorch/pytorch/pull/141899 Approved by: https://github.com/ydwu4	2024-12-03 01:59:37 +00:00
James Wu	d633cf1f55	Set remote cache version and backend type once in compilation metrics (#141707 ) This is causing FbFxGraphRemoteCache.init to no longer be idempotent, i.e. only safe to call once per compile. AOTAutogradCache initializes a new remote cache for the forward and the backward. Technically, we could make AOTAutogradCache smart and globally thread through a single FbFxGraphRemoteCache everywhere. But there's no reason to do so, as this class is just the handle to access the cache. Plus, it's very brittle for FbFxGraphRemoteCache to not be safe to call multiple times. (Same problem, different fix of D66502138) Differential Revision: [D66508492](https://our.internmc.facebook.com/intern/diff/D66508492/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/141707 Approved by: https://github.com/ezyang	2024-12-03 01:49:11 +00:00
Yu, Guangye	77748ed8ec	fix c10::Event UT failure on XPU backend (#141800 ) # Motivation Fix this UT failure introduced by https://github.com/pytorch/pytorch/pull/140865. The unrelated failure suppressed this UT failure. It goes to happen since https://github.com/pytorch/pytorch/pull/141546 is landed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141800 Approved by: https://github.com/EikanWang	2024-12-03 01:34:42 +00:00
PyTorch MergeBot	09ce760fef	Revert "Add missing data types at torch export serialization (#138561 )" This reverts commit 1ef1b3b39123255483c51fafbd21217d76e140e7. Reverted https://github.com/pytorch/pytorch/pull/138561 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/138561#issuecomment-2513343401))	2024-12-03 01:32:50 +00:00
Benjamin Glass	4959784dac	Add API query for available per-process CUDA memory (#140620 ) Certain `cpp_wrapper`-enabled tests were OOM-ing in the CI pipeline, with error messages suggesting that sufficient memory was accessible. This ultimately resulted from an internal memory limitation that was not queryable in the API. This PR adds querying for that limit. Additionally, the failing tests had incorrect memory availability checks, and are updated with measured memory requirements. Pull Request resolved: https://github.com/pytorch/pytorch/pull/140620 Approved by: https://github.com/malfet, https://github.com/eqy ghstack dependencies: #141367	2024-12-03 00:24:03 +00:00
Chris Sidebottom	5c33c9202f	Skip test_cpu_repo.py::CPUReproTests::test_auto_zvec_vsx_simd on AArch64 (#141155 ) The skipping logic clearly states it shouldn't be running on this architecture. The test then fails due to `VecNEON` returning `128` from `bit_width()` Pull Request resolved: https://github.com/pytorch/pytorch/pull/141155 Approved by: https://github.com/jgong5, https://github.com/desertfire, https://github.com/malfet	2024-12-03 00:19:06 +00:00
atalman	c17ba69ba5	[submodule] Revert "Adds support for accelerated sorting with x86-simd-sort (#127936 ) (#141901 ) Looks like the original PR caused: https://github.com/pytorch/pytorch/issues/140590 Please see comment: https://github.com/pytorch/pytorch/issues/140590#issuecomment-2508704480 Pull Request resolved: https://github.com/pytorch/pytorch/pull/141901 Approved by: https://github.com/andrewor14, https://github.com/malfet	2024-12-03 00:16:35 +00:00
soulitzer	e41a0b33ec	Allow Fakified subclass to have different device for inner and outer tensor (#141839 ) Previously if a wrapper tensor subclass is fakified, the inner tensors would end up having the same device as the outer tensor. This PR makes it so that inner and outer tensors can have different devices. See OffloadTensor PR https://github.com/pytorch/pytorch/pull/141840/files#diff-3bc0cf540b694f4ec0a3749f78b047456657a53a5657e495ffb68e5970c5fdaaR1955 for an application. A simpler test has been added in this PR. This is technically bc-breaking because now the callback passed to MetaConverter needs to accept an extra argument, but no one external should be using this anyway? Pull Request resolved: https://github.com/pytorch/pytorch/pull/141839 Approved by: https://github.com/bdhirsh ghstack dependencies: #141166	2024-12-03 00:09:41 +00:00
Chris Sidebottom	9830e7b1e4	Update OpenBLAS to 0.3.28 (#137263 ) This includes a number of performance improvements, such as threading optimisations and forwarding GEMM calls to GEMV for calls where N=1 or M=1. See: https://github.com/OpenMathLib/OpenBLAS/releases Pull Request resolved: https://github.com/pytorch/pytorch/pull/137263 Approved by: https://github.com/malfet	2024-12-03 00:05:34 +00:00
Nikita Shulga	9f9105a67b	[MPS] Write/Invoke Metal shaders from C++ (#141547 ) By introducing `DynamicMetalShaderLibrary` and `MetalShaderFunction` Add unittests that also serves as an example of how API works Using this primitive, one can compile and dispatch any 1D or 2D shader over MPS tensor using the following pattern ```cpp auto x = torch::empty({8, 16}, at::device(at::kMPS)); DynamicMetalShaderLibrary lib(R"MTL( kernel void full(device float* t, constant ulong2& strides, uint2 idx [[thread_position_in_grid]]) { t[idx.xstrides.x + idx.ystrides.y] = idx.x + 33.0 * idx.y; } )MTL"); auto func = lib.getKernelFunction("full"); func->runCommandBlock([&] { func->startEncoding(); func->setArg(0, x); func->setArg(1, x.strides()); func->dispatch({8, 16}); }); ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/141547 Approved by: https://github.com/Skylion007	2024-12-02 23:57:59 +00:00
iupaikov-amd	5c2584a14c	[ROCm] Enable inductor GEMM lowering for gfx11 (#141687 ) This check doesn't make sense for some of the AMD gpus since they have the right amount of CUs but multi_processor_count returns WGPs on RDNA while still performing adequately. A lot of tests fail on modern archs due to this check defaulting them to not using the GEMMs backend. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141687 Approved by: https://github.com/pruthvistony, https://github.com/jeffdaily, https://github.com/malfet Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2024-12-02 22:13:34 +00:00
chunhuanMeng	1f3d8896bc	Fix mismatched tensor metadata between FakeTensor and Intel XPU concrete tensor when running `F.logsigmoid` (#141333 ) Fixes https://github.com/pytorch/pytorch/issues/141332 `F.logsigmoid` will return two outputs: `output` and `buffer`. For `F.logsigmoid` cpu path, it will use buffer to store some intermediate values and use them when computing gradients, so it returns a `buffer` tensor with nonzero size. For cuda and xpu paths, buffer is useless, so the `buffer ` tensor size of xpu `F.logsigmoid` will be zero, just like cuda. The root cause of the issue is that the codes in `decompositions.py` (ref:https://github.com/pytorch/pytorch/blob/main/torch/_decomp/decompositions.py#L2803) only handle the cuda cases, when the a fake tensor with device is xpu run to here, it will use the cpu path and return a `buffer` with nonzero size, which is conflict to the implementation of intel xpu concrete tensor. Therefore this pr add conditions to handle xpu cases. Make sure the two returned buffer sizes match each other. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141333 Approved by: https://github.com/guangyey, https://github.com/EikanWang, https://github.com/ezyang	2024-12-02 22:09:20 +00:00
Avik Chaudhuri	74eb92ed6e	fix deep copy of empty graph (#141660 ) Differential Revision: D66532131 Pull Request resolved: https://github.com/pytorch/pytorch/pull/141660 Approved by: https://github.com/ezyang	2024-12-02 22:03:13 +00:00
Bin Bao	41e59754b4	[CI] Remove inductor-perf-test-nightly-a10g.yml (#141895 ) Summary: Deprecate the A10g nightly perf run. The workflow was introduced as an experiment and doesn't seem to be used by developers. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141895 Approved by: https://github.com/huydhn	2024-12-02 21:55:20 +00:00
cyy	55250b324d	[1/N] Apply py39 ruff fixes (#138578 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138578 Approved by: https://github.com/Skylion007	2024-12-02 21:46:18 +00:00
PyTorch MergeBot	b47bdb06d8	Revert "[inductor][pattern matcher] revise mkldnn pattern matcher UT (#141334 )" This reverts commit 942a2438e263a2632b8934dd245060c9b237f4be. Reverted https://github.com/pytorch/pytorch/pull/141334 on behalf of https://github.com/atalman due to Failing internally ([comment](https://github.com/pytorch/pytorch/pull/141334#issuecomment-2512891840))	2024-12-02 21:29:02 +00:00
PyTorch MergeBot	6b05e31042	Revert "[REFACTOR] Inline FxGraphCache.post_compile into sole call site (#141877 )" This reverts commit 61534391ba8204286f5c9ed15ab636e94bd3daf2. Reverted https://github.com/pytorch/pytorch/pull/141877 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but a lot of failures shows up after this lands ([comment](https://github.com/pytorch/pytorch/pull/141877#issuecomment-2512890426))	2024-12-02 21:26:13 +00:00
Colin L. Rice	64d44a39a1	remote_cache: Add a waitcounter for gets and sets (#141307 ) This adds a basic waitcounter to help show if we're spending a lot of time doing gets and sets to remote caches Pull Request resolved: https://github.com/pytorch/pytorch/pull/141307 Approved by: https://github.com/masnesral	2024-12-02 20:48:47 +00:00
PyTorch MergeBot	daa77f3d9f	Revert "[BE]: Update mypy to 1.13.0 (#140808 )" This reverts commit 00134d68af2ce50560fa5a74473665ea229e6c9d. Reverted https://github.com/pytorch/pytorch/pull/140808 on behalf of https://github.com/huydhn due to This is failing a distributed test in trunk, target determination missed this test and did not run it on PR ([comment](https://github.com/pytorch/pytorch/pull/140808#issuecomment-2512788426))	2024-12-02 20:47:43 +00:00
Benjamin Glass	54adbbf6b8	cpp_wrapper: Add support for MemoryFormat arguments (#141367 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/141367 Approved by: https://github.com/desertfire	2024-12-02 20:40:24 +00:00
Edward Z. Yang	30574380a3	[REFACTOR] Factor _fx_graph_cache_key and _time_taken_ns to common base class (#141878 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/141878 Approved by: https://github.com/jamesjwu ghstack dependencies: #141877	2024-12-02 20:07:12 +00:00
Edward Z. Yang	61534391ba	[REFACTOR] Inline FxGraphCache.post_compile into sole call site (#141877 ) I am going to break apart the arguments passed to the constituents to only pass exactly what is needed, so easy access to the insides is helpful here. This also moves two helper functions to output_code.py as well. Also set _boxed_call at constructor. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/141877 Approved by: https://github.com/jamesjwu, https://github.com/jansel	2024-12-02 19:48:05 +00:00
Huy Do	fe68f61c59	Migrate micro benchmark results to benchmark database schema v3 (#141745 ) Similar to https://github.com/pytorch/pytorch/pull/141087, this uploads the micro benchmark results to benchmark database with its new schema v3. The data can then be queried. ~I'm testing with `inductor-micro-benchmark-x86` which should be sufficient because `inductor-micro-benchmark` is broken atm. The CSV output stays for now until the dashboard is migrated to schema v3.~ https://github.com/pytorch/pytorch/issues/141747 has been resolved, so inductor-micro-benchmark should work now Pull Request resolved: https://github.com/pytorch/pytorch/pull/141745 Approved by: https://github.com/yanboliang	2024-12-02 19:45:51 +00:00
cyy	ab5467897a	Fix NOLINTNEXTLINE (#141794 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/141794 Approved by: https://github.com/Skylion007 Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>	2024-12-02 19:22:00 +00:00
soulitzer	161a2340ee	Switch to using Python nested int (#141166 ) Doesn't seem to noticeably slow down eager - TestNestedTensorSubclass tests with and without the PR finished in similar amounts of time (around 57s, 58s) Pull Request resolved: https://github.com/pytorch/pytorch/pull/141166 Approved by: https://github.com/ezyang	2024-12-02 19:17:30 +00:00
Ryan Guo	2d708752f0	[dynamo] Remove `AutoDerefLocalSource` and simplify cell handling (#141629 ) This patch 1. removes `AutoDerefLocalSource` in favor of `LocalSource`, thereby removing its special handling in guards. 2. introduces a `LocalCellSource` for cells from the root frame, with only `reconstruct` implemented, to programmatically enforce that thse cells should never be used by other components like guards. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141629 Approved by: https://github.com/jansel ghstack dependencies: #141628	2024-12-02 19:09:30 +00:00
Ryan Guo	e14d8c980f	[dynamo][NFC] Rename `NewCellVariable` to `CellVariable` (#141628 ) It was named `NewCellVariable` because we originally used it to represent cells by the code Dynamo is tracing through. However, now we use it to represent pre-existing cells as well, so this patch renames it to avoid confusion. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141628 Approved by: https://github.com/williamwen42, https://github.com/jansel	2024-12-02 19:09:30 +00:00
Colin L. Rice	0989871ac9	pytorch/feature: Record if parallel compile is enabled (#141074 ) This gets a bit messy, but this appears to be the best spot to make a true / false decision. Note that since we're looking at whether or not it's used, if the pool doesn't warm up within the time it takes for a compile, we will mark the feature use as false. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141074 Approved by: https://github.com/masnesral ghstack dependencies: #141059	2024-12-02 19:09:11 +00:00
Aaron Gokaslan	00134d68af	[BE]: Update mypy to 1.13.0 (#140808 ) Update mypy to 1.13.0 . Should hopefully reduce linting time. Has support for orjson cache serialization which should improve mypy cache perf if orjson is installed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/140808 Approved by: https://github.com/ezyang, https://github.com/malfet	2024-12-02 18:47:54 +00:00
PyTorch MergeBot	9012e7a62f	Revert "[dynamo][pytree][1/N] make CXX pytree traceable: `tree_iter` / `tree_leaves` (#137397 )" This reverts commit 07850bb2c1771ba3f5578b0aa85792e5cd70de1c. Reverted https://github.com/pytorch/pytorch/pull/137397 on behalf of https://github.com/atalman due to Failing internal test ([comment](https://github.com/pytorch/pytorch/pull/137397#issuecomment-2511934283))	2024-12-02 16:05:14 +00:00
PyTorch MergeBot	eb7deb2db5	Revert "Fix NOLINTNEXTLINE (#141794 )" This reverts commit 7dd9b5fc4343d101294dbbab4b4172f2859460bc. Reverted https://github.com/pytorch/pytorch/pull/141794 on behalf of https://github.com/atalman due to [GH job link](https://github.com/pytorch/pytorch/actions/runs/12087979418/job/33711943084) [HUD commit link](`7dd9b5fc43`) ([comment](https://github.com/pytorch/pytorch/pull/141794#issuecomment-2511789484))	2024-12-02 15:07:50 +00:00
PyTorch MergeBot	a34a56f69f	Revert "Ensure that BlockMask length must always exactly match the sequence length in flex_attention (#141625 )" This reverts commit 795f28ac552eb61d02ea02fd64637ba814133bd8. Reverted https://github.com/pytorch/pytorch/pull/141625 on behalf of https://github.com/albanD due to Broken main ([comment](https://github.com/pytorch/pytorch/pull/141625#issuecomment-2511639687))	2024-12-02 14:10:38 +00:00
PyTorch MergeBot	ec96597e47	Revert "ILP for auto FSDP wrapping (#140298 )" This reverts commit d4cdc098817a0af10b478256b524533ed67285a9. Reverted https://github.com/pytorch/pytorch/pull/140298 on behalf of https://github.com/xuanzhang816 due to for other PR ([comment](https://github.com/pytorch/pytorch/pull/140298#issuecomment-2511638743))	2024-12-02 14:08:04 +00:00
Valentine233	942a2438e2	[inductor][pattern matcher] revise mkldnn pattern matcher UT (#141334 ) Fixes #139970, #139812. Revise mkldnn pattern matcher UTs, to check the relevant specific matched patterns instead of the total matched number. 1) Add the missing specific counters in pattern matchers, e.g. `mkldnn_unary_fusion_matcher_nodes`/`mkldnn_conv_weight_pack_matcher_count`. 2) In UTs, change the general `matcher_count`/`matcher_nodes` checks to the specific ones, e.g. `mkldnn_unary_fusion_matcher_nodes`/`mkldnn_conv_weight_pack_matcher_count`. 3) In UTs, remove the option of `matcher_count`/`matcher_nodes` params in _test_common and make `matcher_check_fn` a necessary param. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141334 Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5	2024-12-02 08:42:10 +00:00
leslie-fang-intel	96d2a511ce	[Inductor][CPP] Fix issue in CPP GEMM Template Prune Tensor (#141798 ) Summary When addressing [issue #134998](https://github.com/pytorch/pytorch/issues/134998), we will verify if any node in the current graph shares the same storage as the node we intend to prune. In the implementation, we assumed that when creating the `GraphLowering` in post-grad phase, there would be no `submodules`, and all `get_attr` nodes would correspond to a `torch.Tensor`. However, this assumption proves incorrect when enabling `FlexAttention`. In this scenario, `submodules` are present as `get_attr` node in post-grad phase. For example: ``` V1128 23:23:47.071000 1965794 torch/_inductor/compile_fx.py:875] [0/1] [__post_grad_graphs] class sdpa_score30(torch.nn.Module): V1128 23:23:47.071000 1965794 torch/_inductor/compile_fx.py:875] [0/1] [__post_grad_graphs] def forward(self, arg0_1: "bf16[][]cpu", arg1_1: "i32[][]cpu", arg2_1: "i32[][]cpu", arg3_1: "i32[][]cpu", arg4_1: "i32[][]cpu"): V1128 23:23:47.071000 1965794 torch/_inductor/compile_fx.py:875] [0/1] [__post_grad_graphs] return arg0_1 V1128 23:23:45.482000 1965794 torch/_inductor/freezing.py:118] [0/1] sdpa_score30 = self.sdpa_score30 V1128 23:23:45.482000 1965794 torch/_inductor/freezing.py:118] [0/1] sdpa_mask30 = self.sdpa_mask30 V1128 23:23:45.482000 1965794 torch/_inductor/freezing.py:118] [0/1] flex_attention_30 = torch.ops.higher_order.flex_attention(add_276, index_put_60, index_put_61, sdpa_score30, (_frozen_param293, _frozen_param295, _frozen_param296, _frozen_param297, _frozen_param298, _frozen_param299, _frozen_param300, _frozen_param301, 64, 64, sdpa_mask30), 0.08838834764831843, {'SKIP_MASK_SCORE': True, 'PRESCALE_QK': False, 'ROWS_GUARANTEED_SAFE': False, 'BLOCKS_ARE_CONTIGUOUS': False, 'OUTPUT_LOGSUMEXP': False}, (), (_frozen_param294,)); add_276 = sdpa_score30 = sdpa_mask30 = None V1128 23:23:45.482000 1965794 torch/_inductor/freezing.py:118] [0/1] getitem_60: "bf16[1, 32, 1, 128]" = flex_attention_30[0]; flex_attention_30 = None ``` We added an extra check in the implementation to ensure only comparing the `get_attr` node with `torch.Tensor`. It is difficult to reproduce this issue using pure high-order operators. Adding a unit test after https://github.com/pytorch/pytorch/pull/141453 lands would be more straightforward. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141798 Approved by: https://github.com/jgong5	2024-12-02 07:38:57 +00:00
PyTorch MergeBot	90f4d60672	Revert "export AOTI_TORCH_EXPORT on Windows. (#140030 )" This reverts commit daed864f7b3ca3b3e64ed13624369fd3007ad47d. Reverted https://github.com/pytorch/pytorch/pull/140030 on behalf of https://github.com/xuhancn due to need to fix on XPU. ([comment](https://github.com/pytorch/pytorch/pull/140030#issuecomment-2510737212))	2024-12-02 07:10:41 +00:00
cyy	8cada5cbe5	Use std::apply (#141834 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/141834 Approved by: https://github.com/Skylion007	2024-12-02 05:49:10 +00:00
Adnan Akhundov	f16e08042c	[user triton] Fix grid codegen for configs with empty kwargs (#141824 ) Fixes #141823 by adding special handling of the codegen `if <config kwargs>: return <grid>` for the cases when there are no kwargs in the config. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141824 Approved by: https://github.com/Chillee	2024-12-02 04:17:21 +00:00
Xu Han	daed864f7b	export AOTI_TORCH_EXPORT on Windows. (#140030 ) Fixes #139954 reproduce UT: ```cmd pytest test/inductor/test_torchinductor_codegen_dynamic_shapes.py -k test_device_assert_dynamic_shapes_cpu ``` Issue: <img width="856" alt="image" src="https://github.com/user-attachments/assets/5fc501a9-54e5-45ac-9fb3-509ec11a7abe"> After fixing: ![Image](https://github.com/user-attachments/assets/883846fb-8e92-4b9c-9400-daab32382a3a) Pull Request resolved: https://github.com/pytorch/pytorch/pull/140030 Approved by: https://github.com/jgong5, https://github.com/desertfire	2024-12-02 03:20:29 +00:00
Yutao Xu	81ab2cc757	Update torch-xpu-ops commit pin (#141201 ) Update the torch-xpu-ops commit to [1e32bbc](`1e32bbc3d9`), includes: - Improve XPU aten operator coverage - Support basic `SparseXPU` operators Pull Request resolved: https://github.com/pytorch/pytorch/pull/141201 Approved by: https://github.com/EikanWang, https://github.com/jansel	2024-12-02 01:49:07 +00:00
chilli	795f28ac55	Ensure that BlockMask length must always exactly match the sequence length in flex_attention (#141625 ) Fixes https://github.com/pytorch/pytorch/issues/141435 Pull Request resolved: https://github.com/pytorch/pytorch/pull/141625 Approved by: https://github.com/drisspg ghstack dependencies: #138788	2024-12-02 00:35:29 +00:00
chilli	8eb259fdc3	Added option to control number of kernel options displayed (#138788 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138788 Approved by: https://github.com/drisspg	2024-12-02 00:35:29 +00:00
cyyever	fc74ec4989	[2/N] Avoid copy in std::get (#141826 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/141826 Approved by: https://github.com/Skylion007, https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2024-12-02 00:16:48 +00:00
Jason Ansel	b2fe1b9409	[inductor] Fix 3d tiling (#141709 ) Fixes #141121 Pull Request resolved: https://github.com/pytorch/pytorch/pull/141709 Approved by: https://github.com/eellison	2024-12-01 19:47:41 +00:00
Roy Hvaara	90f19fee8a	[MPS] Convert `channels_last_3d` to `contiguous` for input tensor in `nn.Conv3d` (#141780 ) When the input tensor to Conv3d is in the channels_last_3d memory format the Conv3d op will generate incorrect output (see example image in #141471). This PR checks if the op is 3d, and then attempts to convert the input tensor to contiguous. Added a regression test that verifies the output by running the same op on the CPU. I'm unsure if Conv3d supports the channels last memory format after #128393. If it does, we should consider updating the logic to utilize this as it would be more efficient. Perhaps @DenisVieriu97 knows or has more context? Fixes #141471 Pull Request resolved: https://github.com/pytorch/pytorch/pull/141780 Approved by: https://github.com/malfet	2024-12-01 18:36:53 +00:00
Blaine Burton Rister	5deca07c0d	[Inductor] Represent tiling as a dict (#141751 ) # Summary Preparatory refactor for https://github.com/pytorch/pytorch/pull/137243. This makes it easier to generalize to multi-dimensional reductions. This diff refactors `self.numels` from a tuple like `(8,16)` to a dict like `{"x": 8, "r": 16}`. Note: this is based off of https://github.com/pytorch/pytorch/pull/141738, which enables `tree.is_reduction`. That PR should land first. # Test plan The existing CI provides good coverage. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141751 Approved by: https://github.com/jansel	2024-12-01 09:54:34 +00:00
cyy	96be048f06	[1/N] Avoid copy in std::get (#141812 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/141812 Approved by: https://github.com/Skylion007	2024-12-01 03:53:35 +00:00
Blaine Burton Rister	c2fa544472	[Inductor] move block pointer analysis to a new module (#141733 ) # Summary Preparatory refactor for https://github.com/pytorch/pytorch/pull/137243. This refactors the ModularIndexing block pointer analysis into its own module. That way, we can call it from other places besides Triton codegen. In the parent PR, we will use this to find tiling splits that simplify the indexing. # Test plan Tested by the existing CI. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141733 Approved by: https://github.com/jansel	2024-11-30 23:21:24 +00:00
Blaine Burton Rister	49fde426ba	[Inductor] Use a helper function to tell if a tree or prefix is a reduction (#141738 ) Preparatory refactor for https://github.com/pytorch/pytorch/pull/137243. Previously, we would typically check for reductions by `tree.prefix == "r"`. This PR moves the check into a helper function. This makes it easier to generalize the code to multi-dimensional reductions, which could have multiple prefixes like `("r0_", "r1_")`. Tested by the existing CI. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141738 Approved by: https://github.com/jansel	2024-11-30 22:38:13 +00:00
Fabian Keller	394c339691	improve typings in unflatten (#141817 ) A first follow-up to https://github.com/pytorch/pytorch/pull/115074 / https://github.com/pytorch/pytorch/pull/141240 following the strategy discussed there (https://github.com/pytorch/pytorch/pull/115074#issuecomment-2480992230). This PR improves the type annotations around `unflatten.py` which had been inaccurate due to the previously suppressed type checking on `torch.nn.Module`. CC @Skylion007 @ezyang Pull Request resolved: https://github.com/pytorch/pytorch/pull/141817 Approved by: https://github.com/Skylion007	2024-11-30 22:12:15 +00:00
FFFrog	8a81f7a4b6	Refactor functions in functorch for functional (#141808 ) As the title stated Pull Request resolved: https://github.com/pytorch/pytorch/pull/141808 Approved by: https://github.com/Skylion007	2024-11-30 20:15:40 +00:00
atalman	0f3f801fc2	Add windows CUDA 12.6 nightly builds (#141805 ) Windows AMI was published to prod. This PR adds CUDA 12.6 nightly builds Pull Request resolved: https://github.com/pytorch/pytorch/pull/141805 Approved by: https://github.com/huydhn, https://github.com/Skylion007	2024-11-30 14:39:47 +00:00
eqy	9532589b53	[CUDA][64-bit indexing] Support 64-bit indexing in `distribution_elementwise_grid_stride_kernel` (#141613 ) For #141544 Overhead doesn't seem to be noticeable even on small sizes (e.g., 2**10 elements) Pull Request resolved: https://github.com/pytorch/pytorch/pull/141613 Approved by: https://github.com/Skylion007, https://github.com/ngimel	2024-11-30 06:55:02 +00:00
Edward Z. Yang	7fafaa9c82	Introduce CompiledAOTI (#141695 ) Stacked on https://github.com/pytorch/pytorch/pull/141691 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/141695 Approved by: https://github.com/aorenste ghstack dependencies: #141681, #141683, #141685, #141688, #141689, #141691	2024-11-30 00:05:41 +00:00
Bob Ren	2f72635a5c	automatic dynamic unspecialize float (#141647 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/141647 Approved by: https://github.com/ezyang	2024-11-29 22:36:53 +00:00
cyy	e29dabbd71	Fix performance-unnecessary-copy-initialization (#141792 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/141792 Approved by: https://github.com/Skylion007	2024-11-29 22:10:06 +00:00
chuanqiw	a23ac6f8bd	[CD] Enable pypi dependencies both for XPU linux and Windows whls (#141135 ) Enable xpu runtime pypi packages as dependencies of XPU CD wheels both for Linux and Windows. Fixes https://github.com/pytorch/pytorch/issues/135867 Works for https://github.com/pytorch/pytorch/issues/139722 and https://github.com/pytorch/pytorch/issues/114850 Pull Request resolved: https://github.com/pytorch/pytorch/pull/141135 Approved by: https://github.com/atalman	2024-11-29 21:35:07 +00:00
George Wigley	44707b0667	Pass rounding_mode for div reference inputs through kwargs (#136308 ) Previously, the reference inputs for div with rounding mode did not supply the rounding_mode keyword argument. This didn't match the sample inputs for this op. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136308 Approved by: https://github.com/albanD Co-authored-by: Xia, Weiwen <weiwen.xia@intel.com> Co-authored-by: Bob Ren <bobren@meta.com> Co-authored-by: Xilun Wu <12968408+XilunWu@users.noreply.github.com> Co-authored-by: siahuat0727 <tansiahuat@gmail.com>	2024-11-29 21:28:24 +00:00
Ke Wen	ed092e2161	[2/N] Rename NCCLTraceBuffer to FlightRecorder (#141712 ) Just name change. No behavior change. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141712 Approved by: https://github.com/wconstab, https://github.com/fduwjj ghstack dependencies: #141648	2024-11-29 21:15:31 +00:00
Zhengxu Chen	a8a570512b	[export] Generate compatible thrift schema out of schema.py (#141611 ) Summary: To make sure schema.py and schema.thrift are kept in sync, we use the int keys from thrift and use Python Annotated type to associate fields between thrift and schema.py. Later we will use this association to build a single source of truth between the schemas. Test Plan: CI Differential Revision: D66253157 Pull Request resolved: https://github.com/pytorch/pytorch/pull/141611 Approved by: https://github.com/yiming0416	2024-11-29 20:09:49 +00:00
cyyever	7dd9b5fc43	Fix NOLINTNEXTLINE (#141794 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/141794 Approved by: https://github.com/Skylion007 Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>	2024-11-29 16:23:59 +00:00
PyTorch MergeBot	9e98b3d73c	Revert "automatic dynamic unspecialize float (#141647 )" This reverts commit 1a32daeb17cd56601c60cb4000a4ef75120af37f. Reverted https://github.com/pytorch/pytorch/pull/141647 on behalf of https://github.com/atalman due to functorch/test_aotdispatch.py::TestAOTAutogradWithCache::test_inner_grad [GH job link](https://github.com/pytorch/pytorch/actions/runs/12080983316/job/33697901875) [HUD commit link](`1a32daeb17`) ([comment](https://github.com/pytorch/pytorch/pull/141647#issuecomment-2507980876))	2024-11-29 15:00:33 +00:00
siahuat0727	3c63e76b03	[PT2E Quantization] Fix RecursionError when prepare_pt2e graph with concat of the same node (#141651 ) Fixes #129038 Related PR #129567 Here is the new PR against main, thanks! @jerryzh168 Pull Request resolved: https://github.com/pytorch/pytorch/pull/141651 Approved by: https://github.com/jerryzh168	2024-11-29 09:19:22 +00:00
Xilun Wu	ce572fedfc	[dtensor][random] use torch.uint64 as the seed/offset tensor dtype to avoid overflow (#141532 ) Summary DTensor RNG code raises error if the seed passed in is beyong `torch.int64` range (e.g. `torch.tensor([2**64-1])` raises error). The solution is to specify the `dtype=torch.uint64` in the `torch.tensor()` call. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141532 Approved by: https://github.com/wconstab ghstack dependencies: #141731, #141220, #141223	2024-11-29 07:59:34 +00:00
Xilun Wu	93cbb287c2	[dtensor][random] allow user to manual_seed different seed on device mesh; only sync RNG state in WORLD when manual_seed has not been called (#141223 ) Summary This PR proposes 4 changes to DTensor RNG management: 1. DTensor allows users to eagerly initialize the RNG tracker by calling `torch.distributed.tensor._random.manual_seed`. 2. DTensor `manual_seed` no longer checks the integrity of the `seed` argument. Users are responsible for setting the same seed on all ranks within an SPMD group, but if there are multiple separate SPMD groups (e.g. across pipeline stages), users should set a _different_ seed for each SPMD group. For cases like Pipeline Parallel, users can set different initial seed for pipelining stages by calling ``` world_mesh = init_device_mesh( device_type="cuda", mesh_shape=(2, 2, 2), mesh_dim_names=("pp", "dp", "tp"), ) pp_mesh = world_mesh["pp"] pp_rank = pp_mesh.get_local_rank() spmd_mesh = world_mesh["dp", "tp"]._flatten("spmd") # this flattening is only needed if you need to call collective over this mesh torch.distributed.tensor._random.manual_seed(123+pp_rank, spmd_mesh) ``` In other word, if users want to call `torch.distributed.tensor._random.manual_seed`, they will be responsible for passing in the right value and DTensor won't perform any checks on it. If the current rank is not a part of the mesh, it will use the current device RNG state to initialize. 3. `OffsetBasedRNGTracker` still performs RNG state synchronization by broadcasting the RNG state on rank 0 to `WORLD`. However, calling `torch.distributed.tensor._random.manual_seed` is an exception. In this case, no broadcast will happen. 4. Enforce that the `manual_seed` call only accept "full mesh" i.e. the DTensor RNG state on every rank must be set through the call. This makes sure that no rank has its RNG state left uninitialized and the SPMD ranks have their RNG state synchronous. Motivation tl;dr 1. Lazily initializing DTensor RNG tracker causes hang in non-SPMD code such as Pipeline Parallel. 2. Users may want to set different seed on ranks in one device mesh. 3. We want to keep the old behavior if users prefer not curating the RNG state and want to have DTensor take care of it. see detail in https://github.com/pytorch/pytorch/issues/140301 Test `pytest test/distributed/_tensor/test_random_ops.py` `pytest test/distributed/tensor/parallel/test_tp_random_state.py` Pull Request resolved: https://github.com/pytorch/pytorch/pull/141223 Approved by: https://github.com/wconstab ghstack dependencies: #141731, #141220	2024-11-29 07:59:34 +00:00
Xilun Wu	7f5bc9dd87	[dtensor][random][tp] remove the adhoc DTensor RNG tracker TensorParallelRNGTracker since it does not match FSDP2+TP (#141220 ) Summary The ad-hoc DTensor RNG tracker was used to mimic Megatron DDP+TP RNG behavior but it turns out not compatible with PyTorch Distributed FSDP2+TP so we decide to deprecate it and use `OffsetBasedRNGTracker` to replace, which follows the SPMD semantics (replicas get the same random sampling result, shards get different results). Motivation `TensorParallelRNGTracker` was designed for DDP+TP where the random operators produce the same result along the data parallel mesh dimension and different results along the tensor parallel dimension. However this does not apply to the new FSDP+TP composable combination where the model weights are sharded along data parallel mesh dimension as well. Therefore we decide to remove this outdated RNG tracker type for now. If users have demands for exact match between PyTorch Distributed and Megatron on Random Number generation result, feel free to file an issue. Impact `TensorParallelRNGTracker` was only used when Tensor Parallel is used (i.e. calling `parallelize_module`). For non-FSDP users, the "replicas get the same random numbers and shards get different ones" remains unchanged. Unlike `TensorParallelRNGTracker` which sets different seeds (`base_seed + 2718 + TP_rank`) within the TP group, DTensor now sets the same seed (default value is 1234 but users can call `torch.distributed.tensor._random.manual_seed` to modify) on all ranks but choose the right RNG offset based on DTensor placements to enforce the "replicas get the same random numbers and shards get different ones" invariant. For FSDP2 users, improvement should be observed in a way that DTensor sharded within DP group now gets different random number sampling which `TensorParallelRNGTracker` failed to do, though we're not sure how much this change will improve the eventual training loss convergence. Test 1-d model weight meta init: `pytest test/distributed/_tensor/test_random_ops.py -s -k test_tp_model_meta_init` 2-d model weight meta init: `pytest test/distributed/_tensor/test_random_ops.py -s -k test_fsdp_tp_model_meta_init` TP model weight init test: `pytest test/distributed/tensor/parallel/test_tp_random_state.py` FSDP+TP model weight init test: `pytest test/distributed/_composable/fsdp/test_fully_shard_init.py` Pull Request resolved: https://github.com/pytorch/pytorch/pull/141220 Approved by: https://github.com/wconstab ghstack dependencies: #141731	2024-11-29 07:59:26 +00:00
Xilun Wu	c55191f3a2	[dtensor][random] add 1d and 2d model meta init tests (#141731 ) Summary Added tests for model meta init on 1-d mesh (TP) and 2-d mesh (FSDP+TP). This exploits the issue where DTensor RNG failed to initialize weights differently across FSDP ranks. Test `pytest test/distributed/_tensor/test_random_ops.py -s -k meta_init` Pull Request resolved: https://github.com/pytorch/pytorch/pull/141731 Approved by: https://github.com/wconstab	2024-11-29 07:59:20 +00:00
Bob Ren	1a32daeb17	automatic dynamic unspecialize float (#141647 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/141647 Approved by: https://github.com/ezyang	2024-11-29 07:53:53 +00:00
Xia, Weiwen	9827d677b4	[Quant][PT2E][X86] annotate and convert for linear_dynamic_fp16 (#141480 ) Annotate linear node for `linear_dynamic_fp16` with `X86InductorQuantizer` After `convert_pt2e`, the pattern will be ``` x \| linear <- to_fp32 <- to_fp16 <- w ``` Test plan ``` pytest test/quantization/pt2e/test_x86inductor_quantizer.py -k test_linear_dynamic_fp16 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/141480 Approved by: https://github.com/jgong5, https://github.com/jerryzh168	2024-11-29 07:48:39 +00:00
Yang Wang	b7a45dbae3	Add monitor script (#141438 ) # Overview Add monitor script to collect system-level utilization data during CI tests. Currently all monitoring scripts are disabled. # Details - Add flag to customize the time intervals for logging - Enable multiple GPU utilization logging # Next step enable monitor scritpt in non-perf-test workflows Pull Request resolved: https://github.com/pytorch/pytorch/pull/141438 Approved by: https://github.com/huydhn	2024-11-29 04:14:31 +00:00
Roy Hvaara	4d5c096a55	[MPS] Add autocast rule for SDPA (#141776 ) Fixes #141774 Pull Request resolved: https://github.com/pytorch/pytorch/pull/141776 Approved by: https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2024-11-29 03:34:03 +00:00
Edward Z. Yang	b97a786125	Inline compile_to_fn at its only call site (#141691 ) Stacked on https://github.com/pytorch/pytorch/pull/141689 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/141691 Approved by: https://github.com/jansel ghstack dependencies: #141681, #141683, #141685, #141688, #141689	2024-11-29 01:15:38 +00:00
Edward Z. Yang	9e4723cc6e	Unify post_compile1 and CompiledFxGraph constructor (#141689 ) Stacked on https://github.com/pytorch/pytorch/pull/141688 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/141689 Approved by: https://github.com/jansel ghstack dependencies: #141681, #141683, #141685, #141688	2024-11-29 01:15:38 +00:00
Edward Z. Yang	29326b9d29	Hoist post_compile1 into fx_codegen_and_compile (#141688 ) Stacked on top of https://github.com/pytorch/pytorch/pull/141685 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/141688 Approved by: https://github.com/Skylion007, https://github.com/jansel ghstack dependencies: #141681, #141683, #141685	2024-11-29 01:15:31 +00:00
Edward Z. Yang	cf3daf723f	Unify cache disable and cache bypass paths (#141685 ) I was constantly annoyed at the fact that we had a separate else branch for when cache was disabled which was distinct from when cache was bypassed. This diff gets rid of the disabled cache branch, so we use the same logic for bypass/disable. I actually think this change probably didn't actually matter much for the POC but I think it's cleaner. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/141685 Approved by: https://github.com/aorenste ghstack dependencies: #141681, #141683	2024-11-29 01:15:24 +00:00
Aaron Gokaslan	7224cd4471	[BE]: Update 12.6 builds to CUDA 12.6.3 (#141433 ) Update CUDA 12.6 to Update 3 and make cusparse-lt 0.6.3? #141365 Was going to leave some comments on #141365, but though it was just faster to open a PR here. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141433 Approved by: https://github.com/atalman	2024-11-28 22:01:47 +00:00
Richard Barnes	ae6519cb74	[codemod] c10::string_view -> std::string_view in fields (#141736 ) Summary: `c10::string_view` is being removed, so we need to migrate. Test Plan: Sandcastle Reviewed By: palmje Differential Revision: D65830276 Pull Request resolved: https://github.com/pytorch/pytorch/pull/141736 Approved by: https://github.com/Skylion007	2024-11-28 21:35:53 +00:00
Ivan Zaitsev	09a3eddc07	Revert #141066 and #141494 (#141721 ) manual revert due to merge conflicts note: #141494 was reverted out of order blocking automatic revert of #141066 Pull Request resolved: https://github.com/pytorch/pytorch/pull/141721 Approved by: https://github.com/avikchaudhuri	2024-11-28 20:18:19 +00:00
PyTorch MergeBot	d08bd6d627	Revert "Refactor test_torchinductor_strided_blocks to also support triton CPU (#141587 )" This reverts commit 8a3317cd41d0442d13090932ae5548e7b9fe45bd. Reverted https://github.com/pytorch/pytorch/pull/141587 on behalf of https://github.com/atalman due to inductor/test_torchinductor_strided_blocks.py::TritonBlockPointerTestGPU::test_expand_broadcast_x_size0_y_size0_cuda [GH job link](https://github.com/pytorch/pytorch/actions/runs/12072823884/job/33669367764) [HUD commit link](`8a3317cd41`) ([comment](https://github.com/pytorch/pytorch/pull/141587#issuecomment-2506690095))	2024-11-28 19:41:03 +00:00
Pruthvi Madugundu	907c31f529	[ROCm] devtoolset / GCC11 upgrade on manylinux images - 1b of 2 (docker images) (#141609 ) Upgrade gcc version from 9 to 11 on ROCm manylinux images. Needed for #141423 since almalinux8-based manylinux2_28 images for ROCm (#140681) installs gcc-toolset-9, which installs [gcc 9.2.1](https://pkgs.org/download/gcc-toolset-9-gcc-c++). However, PyTorch CMakeLists.txt enforces a [minimum gcc version of 9.3](`5318bf8baf/CMakeLists.txt (L61)`). Pull Request resolved: https://github.com/pytorch/pytorch/pull/141609 Approved by: https://github.com/jeffdaily Co-authored-by: Jithun Nair <jithun.nair@amd.com>	2024-11-28 19:18:09 +00:00
Bludator	f4187050fe	[ONNX] Remove special handling of torchvision.ops imports in onnx export (#141569 ) Fixes #141568 Pull Request resolved: https://github.com/pytorch/pytorch/pull/141569 Approved by: https://github.com/titaiwangms Co-authored-by: Justin Chu <justinchuby@users.noreply.github.com> Co-authored-by: Ti-Tai Wang <titaiwang@microsoft.com>	2024-11-28 18:05:40 +00:00
Edward Z. Yang	6d204cb5ed	Hoist set_feature_use out of conditional, rename some variables (#141683 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/141683 Approved by: https://github.com/jamesjwu, https://github.com/jansel ghstack dependencies: #141681	2024-11-28 17:43:11 +00:00
Edward Z. Yang	229daf7470	Inline FxGraphCache.load into its sole call site (#141681 ) I need to restructure the body of FxGraphCache.load with the outer if-else in its call site, so inline it goes! Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/141681 Approved by: https://github.com/jamesjwu, https://github.com/jansel	2024-11-28 17:43:11 +00:00
chuanqiw	b9a8df4bdd	[CD] Add triton xpu build back (#141775 ) Triton xpu build was stopped by https://github.com/pytorch/pytorch/pull/139206 temporally to wait triton xpu upgrade PR https://github.com/pytorch/pytorch/pull/137886 landed. Works for https://github.com/pytorch/pytorch/issues/139722 and https://github.com/pytorch/pytorch/issues/114850 Pull Request resolved: https://github.com/pytorch/pytorch/pull/141775 Approved by: https://github.com/atalman	2024-11-28 17:37:42 +00:00
cyy	6b430c26bd	Fix bugprone-argument-comment (#141777 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/141777 Approved by: https://github.com/Skylion007	2024-11-28 16:56:50 +00:00
Mwiza Kunda	8a3317cd41	Refactor test_torchinductor_strided_blocks to also support triton CPU (#141587 ) This increases test coverage for triton CPU from just test_torchinductor.py to also testing block pointer lowering. Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/141587 Approved by: https://github.com/jansel	2024-11-28 16:45:25 +00:00
Nan Zhang	5aacfa037b	[Inductor] fix broadcast logic for Triton (#141027 ) (#141693 ) Summary: Fix logic for inserting broadcast on kernel with load going directly to store. In the case where load is going directly to store, we insert a tl.broadcast on the store, regardless of the block size on the load. In the case where a broadcast is not required, the downstream Triton compiler is expected to remove this no-op broadcast instruction. Test Plan: Added tests under test_torchinductor_strided_blocks.py:test_expand_broadcast in OSS and internal test cases. Reviewed By: blaine-rister Differential Revision: D65518033 Pull Request resolved: https://github.com/pytorch/pytorch/pull/141693 Approved by: https://github.com/blaine-rister	2024-11-28 16:38:25 +00:00
Laith Sakka	f684dbd002	Try to simplify FloorDiv axioms implications when needed during evaluations. (#141267 ) Summary: This very much the same solution proposed by bobrenjc93 except that it restrict it to expressions and axioms that have FloorDiv, since those are the only ones that could have became CleanDiv. and the only one that can changes as shape env changes. This also does not break torchrec benchmarks, it might be worth it to know why the generalization of this does break the torchrec benchmarks, but we could just be hitting another bug or NYI situation. ovearhead? None on ``` buck2 run fbcode//mode/opt fbcode//torchrec/distributed/tests:pt2_compile_benchmark -- --num-features=1000 ``` Differential Revision: D66307433 Pull Request resolved: https://github.com/pytorch/pytorch/pull/141267 Approved by: https://github.com/ezyang	2024-11-28 15:35:35 +00:00
chuanqiw	d49f0bf466	[CI] Fix xpu linux ci build environment duplicated issue (#141546 ) We found that there are duplicated build environments in XPU linux ci test, it led to test jobs may download wrong pytorch build artifact file. Refer https://github.com/pytorch/pytorch/actions/runs/12023238798/job/33518351906#step:14:633 Works for https://github.com/pytorch/pytorch/issues/139722 and https://github.com/pytorch/pytorch/issues/114850 Pull Request resolved: https://github.com/pytorch/pytorch/pull/141546 Approved by: https://github.com/EikanWang, https://github.com/atalman	2024-11-28 14:21:21 +00:00
atalman	0f261e8f77	Add Manylinux2014 and Manylinux 2.28 config to triton builds. Run auditwheel on triton binaries (#141704 ) This PR combines Manylinux 2_28 and Manylinux 2014 builds of triton under one workflow. This is required in order to support torch cpu, cuda 118, cuda 12.4 wheels built with Manylinux 2014 and torch cuda 12.6 wheels built with Manylinux 2_28. Manylinux 2014 wheels: ``pytorch_triton-3.2.0+git35c6c7c6-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl`` Manylinux 2_28 wheels: ``pytorch_triton-3.2.0+git35c6c7c6-cp39-cp39-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl`` Pull Request resolved: https://github.com/pytorch/pytorch/pull/141704 Approved by: https://github.com/Skylion007, https://github.com/malfet, https://github.com/huydhn	2024-11-28 13:40:39 +00:00
eellison	f83361b274	inductor dtype propagation fixes (#141495 ) - Add in upcast_compute_type on creation of new tensors (loads, constants) - Fixes index_expr - right now we are sort of inconsistent in dtype and dont always respect the dtype specified. would be nice to fix but not doing in this pr. - bug fix in view dtype where we were always upcasting back to fp32 when input was in bf16/fp16. we should only be doing that if the output is also in bf16/fp16. - for masked, avoid calling dtype propagation and just use output dtype. Turns on the runtime dtype verification for opinfo tests. The separate test file is still useful because we can use it for testing turning off codegen_upcast_to_fp32. Follow ups: - We could consider requiring less explicit upcast_compute_types calls and do it automatically. That would potentially make things easier but be less flexible in the future. Maybe I should have done it this pr. - Be more consistent on our index expr dtype printing. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141495 Approved by: https://github.com/blaine-rister, https://github.com/arui-meta, https://github.com/ezyang ghstack dependencies: #139945, #140057	2024-11-28 11:39:38 +00:00
yintong-lu	1ef1b3b391	Add missing data types at torch export serialization (#138561 ) Related to #131654 Added missing FP8 data types at torch export serialization. Added test cases of FP8 data types. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138561 Approved by: https://github.com/jerryzh168, https://github.com/jgong5	2024-11-28 08:35:03 +00:00
Edward Z. Yang	5212ec3879	Add admonition about as_float_unchecked() (#141742 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/141742 Approved by: https://github.com/bdhirsh	2024-11-28 06:25:18 +00:00
Yu, Guangye	d905f1350a	Friendly catch exception when fail to initialize XPU devices (#141658 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/141658 Approved by: https://github.com/EikanWang	2024-11-28 05:17:08 +00:00
Edward Z. Yang	60fe50aa42	Move post compile steps into post_compile1/post_compile2 method (#141656 ) The intention for turning these into methods is so that the AOTInductor compile path can implement them differently. I haven't worked out the implications yet though, but this seemed like a good stopping point for now. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/141656 Approved by: https://github.com/aorenste, https://github.com/jamesjwu, https://github.com/jansel	2024-11-28 04:45:40 +00:00
Aaron Gokaslan	9f48881ba8	[BE]: Enable RUF013 ban implicit optional (#141706 ) Enables RUF013 rule to ban implicit Optional (from areas not already checked by mypy). Pull Request resolved: https://github.com/pytorch/pytorch/pull/141706 Approved by: https://github.com/ezyang	2024-11-28 04:03:01 +00:00
PyTorch MergeBot	b33f770574	Revert "[inductor] Fix 3d tiling (#141709 )" This reverts commit ca9bfa1a384ed6871d4b1874bae81e72c747fd11. Reverted https://github.com/pytorch/pytorch/pull/141709 on behalf of https://github.com/huydhn due to Sorry for reverting your change but there is one failed test showing up in trunk. It was missed by target determination ([comment](https://github.com/pytorch/pytorch/pull/141709#issuecomment-2505213481))	2024-11-28 03:55:31 +00:00
Umberto Valleriani	3becdaf8a7	[c10] Fix static_assert for 32-bit systems (#141244 ) the `__ANDROID__` macro was used as a proxy to check whether compilation is targeting a 32 or 64 bit system, causing build failure on non-android 32 bit linux targets like arm v7. This modification adjusts the check to fail if and only if int64_t and long and not the same on 64-bit systems, on systems where `sizeof(void*) == 8` Like I said in the issue #141043 , I'm not sure whether a different `Scalar` constructor should be defined in the 32 bit case. My code does not break but I'm not sure other people's code won't. Fixes #141043 Pull Request resolved: https://github.com/pytorch/pytorch/pull/141244 Approved by: https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2024-11-28 03:11:52 +00:00
Will Constable	54d26d670e	[CP] Add assertion for unsupported load-balance + non-causal (#141622 ) We actually do not support load-balance mode when non_causal = True, due to changes in data shuffling for load_balance mode. This PR just adds an assertion to make this limitation clear. Fixes #141429 Pull Request resolved: https://github.com/pytorch/pytorch/pull/141622 Approved by: https://github.com/XilunWu	2024-11-28 02:52:35 +00:00
Yu, Guangye	b556549357	Use default context on Windows for Intel GPU (#138049 ) # Motivation Use default context in Windows to keep consistency with Linux. It makes it easy to interact with external libraries like `dlpack`. # Additional Context This PR depends on Intel GPU oneAPI 2025.0. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138049 Approved by: https://github.com/gujinghui	2024-11-28 02:49:46 +00:00
Yu, Guangye	a8482ab3a8	[Reland] Enable XPUEvent elapsed_time function (#140873 ) # Motivation This PR intends to reland https://github.com/pytorch/pytorch/pull/134666 that has been reverted in https://github.com/pytorch/pytorch/pull/140872 We reverted it because I forgot to support `elapsed_time` for `XPUGuardImpl`, which resulted in `c10::Event` not supporting' elapsed_time' and blocking XPU CI. # Additional Context We split https://github.com/pytorch/pytorch/pull/134666 into two parts: one part, PR #140865, supports `elapsed_time` for `torch.Event` and another one, this PR, supports for `torch.xpu.elapsed_time`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/140873 Approved by: https://github.com/gujinghui ghstack dependencies: #140865	2024-11-28 02:41:11 +00:00
Yu, Guangye	b1a8be6b0a	Support torch.Event elapsed_time method on XPU (#140865 ) # Motivation This PR aims to support c10::Event/torch.Event elapsed_time method on XPU. We create a profiling tag Event when the timing flag is enabled. Pull Request resolved: https://github.com/pytorch/pytorch/pull/140865 Approved by: https://github.com/Samkm0084, https://github.com/gujinghui	2024-11-28 02:41:11 +00:00
Hyunho Yeo	d70b7029c8	[MTIA] Support torch.mtia.empty_cache() (#141533 ) Summary: As title Test Plan: Passed a local unit test: `buck2 test //mtia/host_runtime/torch_mtia/tests:test_torch_mtia_api` https://www.internalfb.com/intern/testinfra/testrun/4785074861101240 Reviewed By: nautsimon Differential Revision: D66481778 Pull Request resolved: https://github.com/pytorch/pytorch/pull/141533 Approved by: https://github.com/nautsimon	2024-11-28 02:24:19 +00:00
Yu, Guangye	f35bb55256	Update triton xpu commit pin (#137886 ) # Motivation Due to the code change of https://github.com/pytorch/pytorch/pull/135567, triton-xpu needs to fetch `tensor.data_ptr()` via `uint64` instead of `int64`, refer to https://github.com/intel/intel-xpu-backend-for-triton/pull/2192 # Additional Context triton commit comes from release branch: https://github.com/intel/intel-xpu-backend-for-triton/tree/release/3.2.0 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137886 Approved by: https://github.com/EikanWang, https://github.com/atalman ghstack dependencies: #135567	2024-11-28 02:01:52 +00:00
Yu, Guangye	ac0b0d11ab	[Reland] Fix tensor.data_ptr() representation overflow (#135567 ) # Motivation fix https://github.com/pytorch/pytorch/issues/135550 In PyTorch, [`tensor.data_ptr()`](`e889252493/tools/autograd/templates/python_variable_methods.cpp (L204)`) is reinterpreted by a [signed int64](`e889252493/torch/csrc/autograd/utils/wrap_outputs.h (L50)`) data type, which could result in an overflow issue, like below: ```python import torch a = torch.randn(2).to('xpu') a.data_ptr() # one possible output is -23453392437248 # this is inconsistent with storage.data_ptr() a.untyped_storage().data_ptr() # one possible output is 18446720620317114368 ``` This PR aims to fix this representation overflow issue to make `tensor.data_ptr()` consistent with [`tensor.untyped_storage().data_ptr()`](`c0d2f991b1/torch/csrc/StorageMethods.cpp (L62)`). With this PR, the output will become: ```python import torch a = torch.randn(2).to('xpu') a.data_ptr() # one possible output is 18446720620317114368 # this is consistent with storage.data_ptr() a.untyped_storage().data_ptr() # one possible output is 18446720620317114368 ``` # Solution Use `PyLong_FromVoidPtr` to prevent the overflow issue and fit the semantic of `wrap`. # Additional Context This PR has been reverted (in place, no more change, and revert commit `2e8d431a8f`) due to the change of `tensor.data_ptr()`, which needs to sync up to intel xpu triton side, see [#2192](https://github.com/intel/intel-xpu-backend-for-triton/pull/2192). So we have to update xpu triton commit pin with this PR together. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135567 Approved by: https://github.com/dvrogozh, https://github.com/EikanWang, https://github.com/albanD	2024-11-28 02:01:52 +00:00
cyy	5ca75ac1df	Enable UBSAN tests (#141672 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/141672 Approved by: https://github.com/ezyang	2024-11-28 01:55:15 +00:00
Jason Ansel	ca9bfa1a38	[inductor] Fix 3d tiling (#141709 ) Fixes #141121 Pull Request resolved: https://github.com/pytorch/pytorch/pull/141709 Approved by: https://github.com/eellison	2024-11-28 01:34:28 +00:00
Zhou, Lingzhi	ad3986498a	[Partitioner] Speed up the update of partition map (#136616 ) We can update partition map by iterating users of node but not all of the downstream users of node. The former is faster than the latter which has many duplicate insertion. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136616 Approved by: https://github.com/jgong5, https://github.com/tarun292	2024-11-28 01:11:44 +00:00
cyy	45ed7c13fa	Remove unneeded std::make_optional (#141567 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/141567 Approved by: https://github.com/albanD	2024-11-28 00:05:21 +00:00
PyTorch MergeBot	fea771dcce	Revert "Install magma from a tarball (#140417 )" This reverts commit 30ab10247d07d1682388313c3982e05dd73a055c. Reverted https://github.com/pytorch/pytorch/pull/140417 on behalf of https://github.com/atalman due to Caused failures in calculate docker image ([comment](https://github.com/pytorch/pytorch/pull/140417#issuecomment-2504968996))	2024-11-27 23:22:43 +00:00
Mark Saroufim	e24190709f	[BE] Remove Model Dump utility (#141540 ) So I found this utility by accident, trying to find how many html files we have in the repo so I could convert them to markdown Turns out we package some html and js files in pytorch to visualize torchscript models. This seems kinda strange, probably shouldn't be in core, I removed the tests I could find. Maybe some internal tests will break but considering torchscript is being superseded might make sense to do this Last time there was a meaningful update to the test for this file was about 2 years ago by @digantdesai since then it's a bunch of routine upgrades It seems like this package is unused https://github.com/search?type=code&auto_enroll=true&q=torch.utils.model_dump&p=1 I skimmed through 5 pages of these and the only time this shows up in code search is when someone is either cloning pytorch or checking in their venv into github Pull Request resolved: https://github.com/pytorch/pytorch/pull/141540 Approved by: https://github.com/malfet	2024-11-27 22:52:55 +00:00
Ryan Guo	533798ef46	[dynamo] Enforce some invariants on `ConstantVariable.create` (#140984 ) This addresses https://github.com/pytorch/pytorch/pull/140745#issuecomment-2480854259. Pull Request resolved: https://github.com/pytorch/pytorch/pull/140984 Approved by: https://github.com/jansel ghstack dependencies: #141504	2024-11-27 21:58:35 +00:00
Ryan Guo	3141e038f0	[dynamo] Fix `VariableBuilder._wrap` on frozenset and enforce invariants on `ConstantVariable` (#141504 ) Prior to this patch, we are using `ConstantVariable.create` to create VT for frozenset objects, and intended yet failed to predicate that on all itmes being literals (see https://github.com/pytorch/pytorch/pull/140984#discussion_r1847393736). The code was from https://github.com/pytorch/torchdynamo/commit/7c03434 and the original goal was to help DBR quantization, but as the new test in this patch shows, it could lead to silent incorrectness. Upon a closer look, this exposes some subtleties in how Dynamo handles `ConstantVariable` and `LOAD_CONST`, so this patch both fixes the aforementioned issue and documents, enforces, and makes explicit the invariants around `ConstantVariable` and `LOAD_CONST` -- only immutable objects are supported. Specifically, this patch: 1. refine the checks for wrapping a `frozenset` object, document why we can't just wrap its items directly due to lack of `Sourcec` for set items, and use a safe workaround (`SourcelessBuilder`) to ensure soundness while keeping the DBR quantization support. 2. Adds more types to `common_constant_types`, thereby making `ConstantVariable.is_base_literal` more lenient, and strictly checks this property in the constructor of `ConstantVariable`. 3. Change relevant uses of `create_instruction("LOAD_CONST", ...)` to `create_load_const` which checks `is_safe_constant`, and makes developer overrides explicit by using `create_load_const_unchecked` when needed. 4. In a few places, use more specific `VariableTracker`, e.g., `TypingVariable` rather than `ConstantVariable`, and `FrozensetVariable` rather than `SetVariable`. (2) and (3) are mainly to future-proof Dynamo against bugs like (1). Pull Request resolved: https://github.com/pytorch/pytorch/pull/141504 Approved by: https://github.com/jansel	2024-11-27 21:58:35 +00:00
Jerry Zhang	a962ae511d	Extend gpt-fast LLM dashboard to support torchao autoquant (#140627 ) Summary: We want to test autoquant on relevant LLM models right now only llama2 and mixtral, but want to extend to more models like https://github.com/vllm-project/vllm/tree/main/vllm/model_executor/models Test Plan: ``` Llama-2-7b-chat-hf Mixtral-8x7B-v0.1 gpt-fast int8 112.98 147.92 torchao autoquant 87.41 85.90 torchao autoquantv2 131.12 79.59 ``` https://hud.pytorch.org/benchmark/llms?repoName=pytorch%2Fpytorch in pytorch/benchmarks/gpt_fast ``` python benchmark.py ``` output: ``` Loading model Llama-2-7b-chat-hf Using int8 weight-only quantization! Time to load model: 2.80 seconds Compilation time: 170.24 seconds Average tokens/sec: 112.98 tokens/sec Average bandwidth achieved: 746.86 GB/s Memory used: 7.95 GB Loading model Mixtral-8x7B-v0.1 Using int8 weight-only quantization! Time to load model: 0.24 seconds Compilation time: 181.81 seconds Average tokens/sec: 147.92 tokens/sec Average bandwidth achieved: 953.06 GB/s Memory used: 32.45 GB Loading model Llama-2-7b-chat-hf Time to load model: 0.11 seconds Using autoquant Compilation time: 109.31 seconds Average tokens/sec: 87.17 tokens/sec Average bandwidth achieved: 1151.86 GB/s Memory used: 32.45 GB Loading model Llama-2-7b-chat-hf Time to load model: 0.11 seconds Compilation time: 48.08 seconds Average tokens/sec: 87.41 tokens/sec Average bandwidth achieved: 1155.05 GB/s Memory used: 36.86 GB Loading model Mixtral-8x7B-v0.1 Time to load model: 0.20 seconds Using autoquant Compilation time: 47.32 seconds Average tokens/sec: 85.90 tokens/sec Average bandwidth achieved: 1106.37 GB/s Memory used: 66.81 GB local test (autoquant v2): Loading model Mixtral-8x7B-v0.1 Compilation time: 124.40 seconds Average tokens/sec: 90.41 tokens/sec Average bandwidth achieved: 1164.47 GB/s Memory used: 53.91 GB Loading model Llama-2-7b-chat-hf TODO ``` gpt_fast_benchmark.csv: ``` name,metric,target,actual,dtype,device,arch,is_model Llama-2-7b-chat-hf,token_per_sec,144,112.98,int8,cuda,NVIDIA PG509-210,True Llama-2-7b-chat-hf,memory_bandwidth(GB/s),957,746.86,int8,cuda,NVIDIA PG509-210,True Llama-2-7b-chat-hf,compilation_time(s),136,170.24,int8,cuda,NVIDIA PG509-210,True Mixtral-8x7B-v0.1,token_per_sec,175,147.92,int8,cuda,NVIDIA PG509-210,True Mixtral-8x7B-v0.1,memory_bandwidth(GB/s),1130,953.06,int8,cuda,NVIDIA PG509-210,True Mixtral-8x7B-v0.1,compilation_time(s),133,181.81,int8,cuda,NVIDIA PG509-210,True gemv,memory_bandwidth(GB/s),870,867.06,int8,cuda,NVIDIA PG509-210,False gemv,memory_bandwidth(GB/s),990,1092.43,bfloat16,cuda,NVIDIA PG509-210,False layer_norm,memory_bandwidth(GB/s),950,573.57,bfloat16,cuda,NVIDIA PG509-210,False Llama-2-7b-chat-hf,token_per_sec,144,87.17,autoquant,cuda,NVIDIA PG509-210,True Llama-2-7b-chat-hf,memory_bandwidth(GB/s),957,1151.86,autoquant,cuda,NVIDIA PG509-210,True Llama-2-7b-chat-hf,compilation_time(s),136,109.31,autoquant,cuda,NVIDIA PG509-210,True gather_gemv,memory_bandwidth(GB/s),990,945.38,int8,cuda,NVIDIA PG509-210,False gather_gemv,memory_bandwidth(GB/s),1060,1188.29,bfloat16,cuda,NVIDIA PG509-210,False mlp_layer_norm_gelu,flops_utilization,0.8,0.82,bfloat16,cuda,NVIDIA PG509-210,False Llama-2-7b-chat-hf,token_per_sec,94,87.41,bfloat16,cuda,NVIDIA PG509-210,True Llama-2-7b-chat-hf,memory_bandwidth(GB/s),1253,1155.05,bfloat16,cuda,NVIDIA PG509-210,True Llama-2-7b-chat-hf,compilation_time(s),133,48.08,bfloat16,cuda,NVIDIA PG509-210,True Mixtral-8x7B-v0.1,token_per_sec,175,85.90,autoquant,cuda,NVIDIA PG509-210,True Mixtral-8x7B-v0.1,memory_bandwidth(GB/s),1130,1106.37,autoquant,cuda,NVIDIA PG509-210,True Mixtral-8x7B-v0.1,compilation_time(s),133,47.32,autoquant,cuda,NVIDIA PG509-210,True ``` Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/140627 Approved by: https://github.com/huydhn	2024-11-27 21:57:48 +00:00
Andrea Frittoli	30ab10247d	Install magma from a tarball (#140417 ) Magma is built for specific CUDA versions and stored in the ossci-linux bucket. Install it from there rather than the deprecated conda package. There are two places where magma is installed today: - `install_conda.sh`: extract the magma package in the same exact location where conda would install it, using a dedicated `install_magma_conda.sh` script. The new script is included in the relevant Dockerfiles where CUDA+magma is needed - `install_magma.sh`: this script already uses a tarball. Use the new tarball instead of the tarball from the conda package. The format of the new tarball is compatible with the old one, so changes here are minimal:wq Fixes #140538 Test PR: https://github.com/pytorch/pytorch/pull/141584 Pull Request resolved: https://github.com/pytorch/pytorch/pull/140417 Approved by: https://github.com/atalman	2024-11-27 21:56:20 +00:00
drisspg	02b52572db	Lint: switch oncall owner for test_transformers (#141722 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/141722 Approved by: https://github.com/malfet	2024-11-27 21:45:43 +00:00
Yanbo Liang	5f004f455a	[Dynamo][Distributed] Fix ProcessGroup getattr (#141638 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/141638 Approved by: https://github.com/williamwen42, https://github.com/jansel	2024-11-27 21:42:33 +00:00
Edward Z. Yang	dbbebee9d7	Code motion CompiledFxGraph to a dedicated file (#141654 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/141654 Approved by: https://github.com/aorenste, https://github.com/jansel ghstack dependencies: #141491, #141492, #141574	2024-11-27 20:42:21 +00:00
James Wu	a7ca6a9113	Enable autograd cache on inductor tests (#140890 ) This turns on AOTAutogradCache for all inductor tests. It clears AOTAutogradCache on each test as well, by virtue of the local cache using the same directory to store cache entries. I've also tested with INDUCTOR_TEST_DISABLE_FRESH_CACHE=1, running all the tests. AOTAutogradCache successfully caches 99% of these. There are a few tests that use view_replay and therefore save functional tensors, which cause AOTAutogradCache to fail to pickle its result. Will look into next steps there, but for now, it seems okay if the cache just misses on those cases where it can't serialize the result. It would be better to check before pickling, though. I've made the following small bugfixes to get this working: - Inductor is sometimes used in a standalone mode without dynamo, which leads to attribute errors in check_can_cache. In general, we should never crash in cache checking, only bypass. So I change a try catch to check Exception instead of just a specific exception. - Add extra structured logging for metadata on cache hits Pull Request resolved: https://github.com/pytorch/pytorch/pull/140890 Approved by: https://github.com/bdhirsh	2024-11-27 20:41:43 +00:00
FindHao	ab63b679e9	Save indexing for getitem nodes when do custom replacements (#140193 ) Fixes #137280 When we have multiple indexings for the same array as returned items in pattern replacement, we shouldn't ignore its indexing numbers. otherwise, we may create a wrong pattern_to_node mapping. A unit test is added in this PR. In this unit test, the function `rms_pattern_static` is replaced with `rms_replacement_static` when called. The function `rms_pattern_static` calls two functionalized custom operators, `torch.ops.vllm.rms_norm.default` and `torch.ops.vllm.static_scaled_int8_quant.default`, and it returns at2[1] and at2[2] as outputs. The function `rms_replacement_static` calls one functionalized custom operator `torch.ops.vllm.fused_rms_norm_quant_static.default`, which returns two corresponding items. Run `python test/inductor/test_pattern_matcher.py -k test_multioutput_register_replacement` to test. After set `TORCH_COMPILE_DEBUG` to 1, the final part of the `fx_graph_readable.py` is like the following. ```python # File: /home/yhao/p9/pytorch/test/inductor/test_pattern_matcher.py:1673 in rms_pattern_static, code: at1 = auto_functionalized( auto_functionalized = torch.ops.higher_order.auto_functionalized(torch.ops.vllm.rms_norm.default, result = permute_1, input = convert_element_type, weight = convert_element_type_1, epsilon = 1e-06); permute_1 = convert_element_type = convert_element_type_1 = None getitem_1: "bf16[5, 4]" = auto_functionalized[1]; auto_functionalized = None # File: /home/yhao/p9/pytorch/test/inductor/test_pattern_matcher.py:1680 in rms_pattern_static, code: at2 = auto_functionalized( auto_functionalized_1 = torch.ops.higher_order.auto_functionalized(torch.ops.vllm.static_scaled_int8_quant.default, result = permute, input = getitem_1, scale = full_default, azp = None); permute = getitem_1 = full_default = None getitem_3: "i8[5, 4]" = auto_functionalized_1[1] getitem_4: "f32[1, 1]" = auto_functionalized_1[2]; auto_functionalized_1 = None return (getitem_3, getitem_4) ``` This happens before pattern matching, so is it expected to call `static_scaled_int8_quant` and `rms_norm` and return auto_functionalized_1 as outputs. However, for pytorch before this PR, the `fx_graph_transformed.py`, which is after pattern matching, has the following code. ```python # File: /home/yhao/p9/pytorch/test/inductor/test_pattern_matcher.py:1748 in my_func_static, code: scale = torch.ones((1, 1)) full_default: "f32[1, 1]" = torch.ops.aten.full.default([1, 1], 1, dtype = torch.float32, layout = torch.strided, device = device(type='cpu'), pin_memory = False) # No stacktrace found for following nodes as_strided_default: "i8[20]" = torch.ops.aten.as_strided.default(permute, [20], [1], 0) clone_default: "i8[20]" = torch.ops.aten.clone.default(as_strided_default); as_strided_default = None as_strided_default_1: "i8[5, 4]" = torch.ops.aten.as_strided.default(clone_default, [5, 4], [4, 1], 0); clone_default = None as_strided_default_2: "f32[1]" = torch.ops.aten.as_strided.default(full_default, [1], [1], 0) clone_default_1: "f32[1]" = torch.ops.aten.clone.default(as_strided_default_2); as_strided_default_2 = None as_strided_default_3: "f32[1, 1]" = torch.ops.aten.as_strided.default(clone_default_1, [1, 1], [1, 1], 0); clone_default_1 = None static_scaled_int8_quant_default = torch.ops.vllm.static_scaled_int8_quant.default(as_strided_default_1, permute_1, as_strided_default_3); as_strided_default_1 = permute_1 = static_scaled_int8_quant_default = None fused_rms_norm_quant_static_default = torch.ops.vllm.fused_rms_norm_quant_static.default(permute, convert_element_type, convert_element_type_1, full_default, None, 1e-06); convert_element_type = convert_element_type_1 = full_default = fused_rms_norm_quant_static_default = None return (permute, as_strided_default_3) ``` Here, it returns `(permute, as_strided_default_3)` while `permute` is written by fused_rms_norm_quant_static and `as_strided_default_3` is written by `static_scaled_int8_quant`. This is wrong because in our expectation, the `static_scaled_int8_quant` should be removed since it is replaced with `fused_rms_norm_quant_static`. It is supposed to return `(permute, full_default)`. The root cause is the following part. When we [generate patterns](`5f4a21dc58/torch/_inductor/pattern_matcher.py (L1580)`) with traced fx graph and call the following function, the indexing numbers' type int in traced graph are ignored in `ignore_types`. So, the final arguments of patterns for those two output items are like `(CallFunction(auto_functionalized,XXX)), )`. `5f4a21dc58/torch/_inductor/pattern_matcher.py (L1839-L1847)` When we do pattern matching after we generated patterns in the following part, the `sorted(itertools.chain.from_iterable(nodes), reverse=True)` is `[getitem_4, getitem_3, getitem_1]`. The getitem_4's iteration is always FailedMatch because we always use the first element to do the pattern match here (it fails on different match functions before and after this PR, but the reason is always the indexing numbers issue)`d4cdc09881/torch/_inductor/pattern_matcher.py (L848)`. However, when we do pattern matching for getitem_3, the child_match returns a match for getitem_3 again which is because the `` pattern can match anything. Then the getitem_3's pattern matching returns a `[getitem_3, getitem_3]` as outputs which are wrong. `d4cdc09881/torch/_inductor/pattern_matcher.py (L856)` `d4cdc09881/torch/_inductor/pattern_matcher.py (L1750-L1774)` This PR doesn't ignore `int` type when we generate patterns for getitem functions because integer indexing numbers are important to them. Thus, the indexing information is kept in patterns, ensuring correct matchings. With this PR, the above `child_match` returns a match for getitem_4, and the final getitem_3's pattern matching returns the correct `[getitem_3, getitem_4]`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/140193 Approved by: https://github.com/eellison	2024-11-27 20:19:13 +00:00
Isuru Fernando	b37cfddeb3	Refactor ShapeGuardPrinter for future C++ addiiton (#140968 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/140968 Approved by: https://github.com/anijain2305 ghstack dependencies: #140597	2024-11-27 20:09:58 +00:00
Francisco Massa	e5d02e0cfb	Fix non-determinism in the partitioner (#141682 ) When multiple nodes have similar sizes and are part of the `banned_nodes` (which is a `set` and not a `list`), there is non-determinism present in the partitioner due to sorting only by node-size. This PR fixes this by also sorting by node name. It would be good to add some tests, but I'm not sure about the best way to do it here. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141682 Approved by: https://github.com/Chillee, https://github.com/yf225	2024-11-27 19:33:15 +00:00
PyTorch MergeBot	8c90a9a030	Revert "fix non termination in unflatten + state (#141494 )" This reverts commit 5d7c3701e40374113921771097ebc65d9c2876bf. Reverted https://github.com/pytorch/pytorch/pull/141494 on behalf of https://github.com/jovianjaison due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/141494#issuecomment-2504639230))	2024-11-27 19:30:55 +00:00
Huy Do	63e3cfc00b	Enable both training and inference perf benchmark on GPU by default when using workflow dispatch (#141708 ) A feedback from @bobrenjc93 that while https://github.com/pytorch/pytorch/actions/workflows/inductor-perf-test-nightly.yml didn't run inference perf by default, the dashboard shows that mode on its landing page https://hud.pytorch.org/benchmark/compilers. This is a source of confusion because folks won't see their branches unless they choose the correct mode. IMO, it makes sense to run both training and inference by default when using workflow dispatch. This ensures that the branch will show up in both modes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141708 Approved by: https://github.com/bobrenjc93	2024-11-27 19:00:25 +00:00
Junjie Wang (PyTorch)	53f8a5fde2	[FR] Include mismatch rank into mismatch_collectives and update log message (#141631 ) Summary: We want to return the mismatch ranks info in the `mismatch_collectives` field. Also update the logging message when no error is found and it's not partial analysis. Test Plan: CI Differential Revision: D66522602 Pull Request resolved: https://github.com/pytorch/pytorch/pull/141631 Approved by: https://github.com/c-p-i-o	2024-11-27 18:57:21 +00:00
Boyuan Feng	17fd53d8e5	[Inductor] Inplacing with Donated Buffer (#140113 ) Currently, inductor does not inplace update a buffer if it is an input buffer. Because we don't know if an input will be used by other functions. Donated buffer provides additional information that an input buffer will not be used by other functions. So we can inplace update donated buffer when possible. [Dashboard](https://hud.pytorch.org/benchmark/torchbench/inductor_dynamic?dashboard=torchinductor&startTime=Mon,%2011%20Nov%202024%2018:14:36%20GMT&stopTime=Mon,%2018%20Nov%202024%2018:14:36%20GMT&granularity=hour&mode=training&dtype=amp&deviceName=cuda%20(a100)&lBranch=bf/donated-buffer-inplace&lCommit=5df0769c00e6f9000caeb10fd5cbf0b165f69c2a&rBranch=main&rCommit=2b39a8db7741b816b03677a9c6fec1af05640dee) ![image](https://github.com/user-attachments/assets/f19d961f-7973-418e-9de8-5c2a97950478) ![image](https://github.com/user-attachments/assets/df3bd6a9-58b8-4e8a-8397-9e3b1de9adfe) Pull Request resolved: https://github.com/pytorch/pytorch/pull/140113 Approved by: https://github.com/eellison	2024-11-27 18:51:52 +00:00
Robert Hardwick	75fbcc5743	[ARM] Expand linux aarch64 unit test list (#140799 ) Expand the list of unit tests for test_linux_aarch64 These have been verified externally as passing on neoverse n1 and v1 based machines. @malfet Pull Request resolved: https://github.com/pytorch/pytorch/pull/140799 Approved by: https://github.com/snadampal, https://github.com/malfet	2024-11-27 18:43:55 +00:00
Ke Wen	ad39a2fc46	[1/N] Decouple Flight Recorder from NCCL utils (#141648 ) Part of the effort to make Flight Recorder device agnostic. Step 1: Move it out of NCCLUtils. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141648 Approved by: https://github.com/fduwjj	2024-11-27 18:29:42 +00:00
eellison	fd553b9817	Add remaining method and tests for dtype propagation (#140057 ) Adds the remaining unimplemented ops as well as an assertion failure if someone adds a new op without a dtype rule. We test all unique pointwise operators registered as lowerings which have an opinfo. There will be some follow ups for this to work well with both `codegen_upcast_to_fp32` as True and False. Pull Request resolved: https://github.com/pytorch/pytorch/pull/140057 Approved by: https://github.com/arui-meta, https://github.com/blaine-rister, https://github.com/ezyang ghstack dependencies: #139945	2024-11-27 17:06:44 +00:00
eellison	566ceb3e7e	Refactor dtype propagation (#139945 ) A couple changes. - Tries to reuse dtype propagation rules that were already registered in inductor. These were present both with `pointwise_overrides_data` and the `boolean_ops` list. Additionally, the registration of pointwise ops already specified dtype propagation rules. Saves those registrations and reuses them later. - Factors out `get_promoted_dtype` which uses functools.lru_cache to take in non - CSEVariable args because those will not work with the functools cache. Tests get added later in the stack when everything is implemented. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139945 Approved by: https://github.com/blaine-rister, https://github.com/arui-meta, https://github.com/ezyang	2024-11-27 16:57:02 +00:00
Nikita Shulga	8012ff96ba	[MPS] Add `MetalShaderLibrary::getFunctionNames()` (#141499 ) That returns names of all the function in shader Pull Request resolved: https://github.com/pytorch/pytorch/pull/141499 Approved by: https://github.com/manuelcandales, https://github.com/Skylion007 ghstack dependencies: #141474, #141475, #141476, #141477	2024-11-27 16:53:38 +00:00
Benjamin Glass	381213ee8a	test_torchinductor: Improve cpp_wrapper skip message (#141176 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/141176 Approved by: https://github.com/desertfire	2024-11-27 16:35:54 +00:00
Chris Sidebottom	893ca5f671	Remove check_metrics_vec_kernel_count from test_cpu_repro.py::CPUReproTests::test_transpose_non_contiguous (#141246 ) The test was initially added due to accuracy issues which is sufficiently covered by the `self.common(fn, (x,))` assertion. Unfortunately, the test fails due to tiling logic on `128-bit` vector size, which is outside the scope of this test and therefore it was overly specific. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141246 Approved by: https://github.com/desertfire	2024-11-27 16:04:21 +00:00
xinan.lin	b75bb64eb4	[AOTI XPU] Rename test_cuda_cpp_wrapper.py to test_gpu_cpp_wrapper.py, (#135320 ) [Inductor] Rename test_cuda_cpp_wrapper.py to test_gpu_cpp_wrapper.py, since the test suite is shared by cuda and xpu. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135320 Approved by: https://github.com/jansel, https://github.com/EikanWang, https://github.com/desertfire ghstack dependencies: #135318	2024-11-27 14:08:06 +00:00
Edward Z. Yang	7ea0da2d57	Modest code motion in compile_fx (#141574 ) Do code review with whitespace changes off. Check comments for what I changed. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/141574 Approved by: https://github.com/bobrenjc93, https://github.com/jansel ghstack dependencies: #141491, #141492	2024-11-27 13:38:14 +00:00
Natalia Gimelshein	4ae1c4cbb5	Implement nonzero for large inputs (#141592 ) Fixes #51871 Pull Request resolved: https://github.com/pytorch/pytorch/pull/141592 Approved by: https://github.com/ezyang	2024-11-27 10:19:53 +00:00
leslie-fang-intel	aa827e319e	[Inductor][CPP] Extract common functions to be reused in other CPP Template (#141554 ) Summary Extract common internal functions from GEMM Template into public function, so these functions can be reused by the subsequent group GEMM template. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141554 Approved by: https://github.com/jgong5	2024-11-27 09:52:18 +00:00
axel	763038db66	Clarify torch.arange floating-point rounding behavior (#141655 ) Added documentation note clarifying the rounding behavior of `torch.arange` when using floating-point dtypes, particularly for reduced precision types like `bfloat16`. This helps users understand potential issues like repeated values and provides guidance on using integer dtypes for precise sequences. ## Changes - Added explanatory note about floating-point rounding behavior and its effects - Included specific mention of `bfloat16` dtype issues - Added recommendation to use integer dtypes for precise sequences Fixes [#137774](https://github.com/pytorch/pytorch/issues/137774) Pull Request resolved: https://github.com/pytorch/pytorch/pull/141655 Approved by: https://github.com/cpuhrsch	2024-11-27 09:31:39 +00:00
Jaewoo Song	43a2a231d3	Support linear/BN fusion and follow the API guideline (#141585 ) Current `fuse` function supports conv/BN fusions only. This commit is to support linear/BN fusion as well. Changes to follow the API guidelines are also applied. (This will close the PR #141352 which I created for the same topic and got approval but had lint and API guideline problems.) Pull Request resolved: https://github.com/pytorch/pytorch/pull/141585 Approved by: https://github.com/ezyang	2024-11-27 06:52:00 +00:00
Ke Wen	9e299b883b	[c10d] Test needs abort; otherwise will hang (#141509 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/141509 Approved by: https://github.com/wz337, https://github.com/fduwjj	2024-11-27 05:47:17 +00:00
Jesse Cai	5accae4197	[sparse] add extra options to _cslt_spare_mm (#137427 ) Summary: Splitting this PR into two, one for the cuSPARSELt improvements, and one for the inductor lowering. This PR adds in the additional cuSPARSELt bindings into pytorch. * `torch._cslt_sparse_mm_search` will be deprecated in a future PR, so a warning has been added * Added a header file for cuSPARSELtOps.cpp * max_id is now available in `torch.backends.cusparselt` via `torch.backends.cusparselt.get_max_alg_id()` * fixed meta registrations for float8 Test Plan: python test/test_sparse_semi_structured.py Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/137427 Approved by: https://github.com/cpuhrsch, https://github.com/eqy	2024-11-27 05:32:45 +00:00
Nikita Shulga	e3161ba6ec	[BE] Fix incompatible-std-redefinition warning (#141630 ) Fixes following warning during CUDA bazel builds ``` nvcc-real warning : incompatible redefinition for option 'std', the last value of this option was used ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/141630 Approved by: https://github.com/cyyever, https://github.com/kit1980	2024-11-27 05:06:36 +00:00
vasiliy	3d5fe0ce78	torch._scaled_mm: support dims of size 0 for tensorwise scaling (#140967 ) Summary: Ensures we support dims of size 0 properly in `torch._scaled_mm`. Follows the behavior from `torch.mm`. For now only enable support for tensorwise, we can tackle rowwise in a future PR. Test Plan: ``` python test/test_matmul_cuda.py -k test_zero_dim ``` Reviewers: Subscribers: Tasks: Tags: Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/140967 Approved by: https://github.com/eqy, https://github.com/drisspg	2024-11-27 04:07:52 +00:00
PyTorch MergeBot	6e61ff4fd3	Revert "Add `truediv` support in export serializer (#136364 )" This reverts commit 1df440dc4e7ece40db597ce8e477e14b9c44fea7. Reverted https://github.com/pytorch/pytorch/pull/136364 on behalf of https://github.com/huydhn due to Sorry for reverting your change but its doc build failure is legit ([comment](https://github.com/pytorch/pytorch/pull/136364#issuecomment-2502620732))	2024-11-27 03:24:31 +00:00
zeshengzong	19d01a1ef0	Apply clang-format for ATen/core/boxing headers (#141105 ) Code change via add path config in `.lintrunner.toml` file and running ```bash $ lintrunner -a --take CLANGFORMAT --all-files ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/141105 Approved by: https://github.com/cyyever, https://github.com/ezyang Co-authored-by: Edward Z. Yang <ezyang@meta.com>	2024-11-27 02:49:24 +00:00
Joel Schlosser	c9e2b3fefe	NJT: Return correct number of outputs for chunk() on the batch dim (#141604 ) Old logic was completely wrong, returning `chunk_size` chunks instead of the intended number. The original test didn't catch this because `chunk_size == num_chunks` :p New OpInfo-based testing covers it though. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141604 Approved by: https://github.com/soulitzer ghstack dependencies: #141500, #140736, #140161, #141392, #141506	2024-11-27 02:31:23 +00:00
Joel Schlosser	43121b6f0d	Adjust output NJT ragged_idx for reductions and select() (#141506 ) This fixes some bugs when performing reductions / select() on dims before the ragged dim. In this case, the output NJT has a smaller number of dims, and its ragged_idx should reflect that correctly. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141506 Approved by: https://github.com/cpuhrsch, https://github.com/soulitzer ghstack dependencies: #141500, #140736, #140161, #141392	2024-11-27 02:25:53 +00:00
Svetlana Karslioglu	807a7dbf9f	Don't generate modindex (#141601 ) Fixes https://github.com/pytorch/pytorch/issues/141591 The generated index looks ugly. Attempting to not generate it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141601 Approved by: https://github.com/malfet, https://github.com/albanD	2024-11-27 02:07:21 +00:00
Arthur Feeney	0c587c324d	DOC: Correct torch.trapezoid docstring (#141459 ) This is super duper minor, but I believe this corrects a typo in the documentation of `torch.trapezoid`. The documentation says the input is a 1-dimensional tensor $y_0, \dots, y_n$, but it uses summations going from 1 to n-1. Since it's summing over terms $y_i - y_{i-1}$, stopping at n-1 excludes the last partition $y_n - y_{n-1}$, which doesn't match the implementation... ```python # (just showing it does include $y_n - y_{n-1}$) torch.trapezoid([0, 0, 9999]) == 9999 / 2 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/141459 Approved by: https://github.com/colesbury	2024-11-27 01:54:14 +00:00
Richard Barnes	fca0f34b83	Switch c10::string_view to std::string_view (#139635 ) Shortens `string_view_starts_with` to `starts_with`. Adds some missing headers. Isolates `c10_string_view` to use with `get_fully_qualified_name`. Test Plan: Sandcastle Reviewed By: ezyang Differential Revision: D64833558 Pull Request resolved: https://github.com/pytorch/pytorch/pull/139635 Approved by: https://github.com/Skylion007, https://github.com/ezyang	2024-11-27 01:41:18 +00:00
Sypherd	d6276c2fbd	Remove double space from warning (#141566 ) Removes a double space from a warning in a way consistent with prior lines. (Sorry, I saw this a few times when running vllm and the double space was killing me) Pull Request resolved: https://github.com/pytorch/pytorch/pull/141566 Approved by: https://github.com/colesbury	2024-11-27 01:32:00 +00:00
Yoni Chechik	3e90c00a87	Missing space in torch.autograd.Function deprecation warning (#141562 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/141562 Approved by: https://github.com/colesbury	2024-11-27 01:31:26 +00:00
zeshengzong	136ff97095	[dynamo][log] Remove print torch inner stacktrace to let users focus on their code error (#141553 ) Fixes #140394 Test Result ```bash TORCH_LOGS="graph_breaks" python test.py ``` ```python # test.py from typing import List import torch def fn002(x): x = x + 1 torch._dynamo.graph_break() x = x + 1 return x def fn001(x): return fn002(x) torch.compile(fn001, backend="eager")(torch.randn(1)) ``` Before log ``` V1126 16:01:41.701000 1303718 torch/_dynamo/symbolic_convert.py:416] [0/0] [__graph_breaks] Graph break in user code at /home/zong/code/pytorch/../scripts/dynamo.py:6 V1126 16:01:41.701000 1303718 torch/_dynamo/symbolic_convert.py:416] [0/0] [__graph_breaks] Reason: Unsupported: 'skip function graph_break in file /home/zong/code/pytorch/torch/_dynamo/decorators.py' V1126 16:01:41.701000 1303718 torch/_dynamo/symbolic_convert.py:416] [0/0] [__graph_breaks] User code traceback: V1126 16:01:41.701000 1303718 torch/_dynamo/symbolic_convert.py:416] [0/0] [__graph_breaks] File "/home/zong/code/pytorch/../scripts/dynamo.py", line 11, in fn001 V1126 16:01:41.701000 1303718 torch/_dynamo/symbolic_convert.py:416] [0/0] [__graph_breaks] return fn002(x) V1126 16:01:41.701000 1303718 torch/_dynamo/symbolic_convert.py:416] [0/0] [__graph_breaks] File "/home/zong/code/pytorch/../scripts/dynamo.py", line 6, in fn002 V1126 16:01:41.701000 1303718 torch/_dynamo/symbolic_convert.py:416] [0/0] [__graph_breaks] torch._dynamo.graph_break() V1126 16:01:41.701000 1303718 torch/_dynamo/symbolic_convert.py:416] [0/0] [__graph_breaks] Traceback (most recent call last): V1126 16:01:41.701000 1303718 torch/_dynamo/symbolic_convert.py:416] [0/0] [__graph_breaks] File "/home/zong/code/pytorch/torch/_dynamo/symbolic_convert.py", line 641, in wrapper V1126 16:01:41.701000 1303718 torch/_dynamo/symbolic_convert.py:416] [0/0] [__graph_breaks] return inner_fn(self, inst) V1126 16:01:41.701000 1303718 torch/_dynamo/symbolic_convert.py:416] [0/0] [__graph_breaks] ^^^^^^^^^^^^^^^^^^^^ V1126 16:01:41.701000 1303718 torch/_dynamo/symbolic_convert.py:416] [0/0] [__graph_breaks] File "/home/zong/code/pytorch/torch/_dynamo/symbolic_convert.py", line 2314, in CALL V1126 16:01:41.701000 1303718 torch/_dynamo/symbolic_convert.py:416] [0/0] [__graph_breaks] self._call(inst) V1126 16:01:41.701000 1303718 torch/_dynamo/symbolic_convert.py:416] [0/0] [__graph_breaks] File "/home/zong/code/pytorch/torch/_dynamo/symbolic_convert.py", line 2308, in _call V1126 16:01:41.701000 1303718 torch/_dynamo/symbolic_convert.py:416] [0/0] [__graph_breaks] self.call_function(fn, args, kwargs) V1126 16:01:41.701000 1303718 torch/_dynamo/symbolic_convert.py:416] [0/0] [__graph_breaks] File "/home/zong/code/pytorch/torch/_dynamo/symbolic_convert.py", line 879, in call_function V1126 16:01:41.701000 1303718 torch/_dynamo/symbolic_convert.py:416] [0/0] [__graph_breaks] self.push(fn.call_function(self, args, kwargs)) # type: ignore[arg-type] V1126 16:01:41.701000 1303718 torch/_dynamo/symbolic_convert.py:416] [0/0] [__graph_breaks] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ V1126 16:01:41.701000 1303718 torch/_dynamo/symbolic_convert.py:416] [0/0] [__graph_breaks] File "/home/zong/code/pytorch/torch/_dynamo/variables/functions.py", line 328, in call_function V1126 16:01:41.701000 1303718 torch/_dynamo/symbolic_convert.py:416] [0/0] [__graph_breaks] return super().call_function(tx, args, kwargs) V1126 16:01:41.701000 1303718 torch/_dynamo/symbolic_convert.py:416] [0/0] [__graph_breaks] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ V1126 16:01:41.701000 1303718 torch/_dynamo/symbolic_convert.py:416] [0/0] [__graph_breaks] File "/home/zong/code/pytorch/torch/_dynamo/variables/functions.py", line 129, in call_function V1126 16:01:41.701000 1303718 torch/_dynamo/symbolic_convert.py:416] [0/0] [__graph_breaks] return tx.inline_user_function_return(self, [self.self_args(), args], kwargs) V1126 16:01:41.701000 1303718 torch/_dynamo/symbolic_convert.py:416] [0/0] [__graph_breaks] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ V1126 16:01:41.701000 1303718 torch/_dynamo/symbolic_convert.py:416] [0/0] [__graph_breaks] File "/home/zong/code/pytorch/torch/_dynamo/symbolic_convert.py", line 885, in inline_user_function_return V1126 16:01:41.701000 1303718 torch/_dynamo/symbolic_convert.py:416] [0/0] [__graph_breaks] return InliningInstructionTranslator.inline_call(self, fn, args, kwargs) V1126 16:01:41.701000 1303718 torch/_dynamo/symbolic_convert.py:416] [0/0] [__graph_breaks] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ V1126 16:01:41.701000 1303718 torch/_dynamo/symbolic_convert.py:416] [0/0] [__graph_breaks] File "/home/zong/code/pytorch/torch/_dynamo/symbolic_convert.py", line 3045, in inline_call V1126 16:01:41.701000 1303718 torch/_dynamo/symbolic_convert.py:416] [0/0] [__graph_breaks] return cls.inline_call_(parent, func, args, kwargs) V1126 16:01:41.701000 1303718 torch/_dynamo/symbolic_convert.py:416] [0/0] [__graph_breaks] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ V1126 16:01:41.701000 1303718 torch/_dynamo/symbolic_convert.py:416] [0/0] [__graph_breaks] File "/home/zong/code/pytorch/torch/_dynamo/symbolic_convert.py", line 3171, in inline_call_ V1126 16:01:41.701000 1303718 torch/_dynamo/symbolic_convert.py:416] [0/0] [__graph_breaks] tracer.run() V1126 16:01:41.701000 1303718 torch/_dynamo/symbolic_convert.py:416] [0/0] [__graph_breaks] File "/home/zong/code/pytorch/torch/_dynamo/symbolic_convert.py", line 1032, in run V1126 16:01:41.701000 1303718 torch/_dynamo/symbolic_convert.py:416] [0/0] [__graph_breaks] while self.step(): V1126 16:01:41.701000 1303718 torch/_dynamo/symbolic_convert.py:416] [0/0] [__graph_breaks] ^^^^^^^^^^^ V1126 16:01:41.701000 1303718 torch/_dynamo/symbolic_convert.py:416] [0/0] [__graph_breaks] File "/home/zong/code/pytorch/torch/_dynamo/symbolic_convert.py", line 944, in step V1126 16:01:41.701000 1303718 torch/_dynamo/symbolic_convert.py:416] [0/0] [__graph_breaks] self.dispatch_table[inst.opcode](self, inst) V1126 16:01:41.701000 1303718 torch/_dynamo/symbolic_convert.py:416] [0/0] [__graph_breaks] File "/home/zong/code/pytorch/torch/_dynamo/symbolic_convert.py", line 641, in wrapper V1126 16:01:41.701000 1303718 torch/_dynamo/symbolic_convert.py:416] [0/0] [__graph_breaks] return inner_fn(self, inst) V1126 16:01:41.701000 1303718 torch/_dynamo/symbolic_convert.py:416] [0/0] [__graph_breaks] ^^^^^^^^^^^^^^^^^^^^ V1126 16:01:41.701000 1303718 torch/_dynamo/symbolic_convert.py:416] [0/0] [__graph_breaks] File "/home/zong/code/pytorch/torch/_dynamo/symbolic_convert.py", line 2314, in CALL V1126 16:01:41.701000 1303718 torch/_dynamo/symbolic_convert.py:416] [0/0] [__graph_breaks] self._call(inst) V1126 16:01:41.701000 1303718 torch/_dynamo/symbolic_convert.py:416] [0/0] [__graph_breaks] File "/home/zong/code/pytorch/torch/_dynamo/symbolic_convert.py", line 2308, in _call V1126 16:01:41.701000 1303718 torch/_dynamo/symbolic_convert.py:416] [0/0] [__graph_breaks] self.call_function(fn, args, kwargs) V1126 16:01:41.701000 1303718 torch/_dynamo/symbolic_convert.py:416] [0/0] [__graph_breaks] File "/home/zong/code/pytorch/torch/_dynamo/symbolic_convert.py", line 879, in call_function V1126 16:01:41.701000 1303718 torch/_dynamo/symbolic_convert.py:416] [0/0] [__graph_breaks] self.push(fn.call_function(self, args, kwargs)) # type: ignore[arg-type] V1126 16:01:41.701000 1303718 torch/_dynamo/symbolic_convert.py:416] [0/0] [__graph_breaks] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ V1126 16:01:41.701000 1303718 torch/_dynamo/symbolic_convert.py:416] [0/0] [__graph_breaks] File "/home/zong/code/pytorch/torch/_dynamo/variables/functions.py", line 708, in call_function V1126 16:01:41.701000 1303718 torch/_dynamo/symbolic_convert.py:416] [0/0] [__graph_breaks] unimplemented(msg) V1126 16:01:41.701000 1303718 torch/_dynamo/symbolic_convert.py:416] [0/0] [__graph_breaks] File "/home/zong/code/pytorch/torch/_dynamo/exc.py", line 313, in unimplemented V1126 16:01:41.701000 1303718 torch/_dynamo/symbolic_convert.py:416] [0/0] [__graph_breaks] raise Unsupported(msg, case_name=case_name) V1126 16:01:41.701000 1303718 torch/_dynamo/symbolic_convert.py:416] [0/0] [__graph_breaks] torch._dynamo.exc.Unsupported: 'skip function graph_break in file /home/zong/code/pytorch/torch/_dynamo/decorators.py' V1126 16:01:41.722000 1303718 torch/_dynamo/symbolic_convert.py:424] [1/0] [__graph_breaks] Graph break (details suppressed) in user code at /home/zong/code/pytorch/../scripts/dynamo.py:6 V1126 16:01:41.722000 1303718 torch/_dynamo/symbolic_convert.py:424] [1/0] [__graph_breaks] Reason: Unsupported: 'skip function graph_break in file /home/zong/code/pytorch/torch/_dynamo/decorators.py ``` After log ``` V1126 16:01:19.900000 1303438 torch/_dynamo/symbolic_convert.py:416] [0/0] [__graph_breaks] Graph break in user code at /home/zong/code/pytorch/../scripts/dynamo.py:6 V1126 16:01:19.900000 1303438 torch/_dynamo/symbolic_convert.py:416] [0/0] [__graph_breaks] Reason: Unsupported: 'skip function graph_break in file /home/zong/code/pytorch/torch/_dynamo/decorators.py' V1126 16:01:19.900000 1303438 torch/_dynamo/symbolic_convert.py:416] [0/0] [__graph_breaks] User code traceback: V1126 16:01:19.900000 1303438 torch/_dynamo/symbolic_convert.py:416] [0/0] [__graph_breaks] File "/home/zong/code/pytorch/../scripts/dynamo.py", line 11, in fn001 V1126 16:01:19.900000 1303438 torch/_dynamo/symbolic_convert.py:416] [0/0] [__graph_breaks] return fn002(x) V1126 16:01:19.900000 1303438 torch/_dynamo/symbolic_convert.py:416] [0/0] [__graph_breaks] File "/home/zong/code/pytorch/../scripts/dynamo.py", line 6, in fn002 V1126 16:01:19.900000 1303438 torch/_dynamo/symbolic_convert.py:416] [0/0] [__graph_breaks] torch._dynamo.graph_break() V1126 16:01:19.900000 1303438 torch/_dynamo/symbolic_convert.py:416] [0/0] [__graph_breaks] V1126 16:01:19.918000 1303438 torch/_dynamo/symbolic_convert.py:423] [1/0] [__graph_breaks] Graph break (details suppressed) in user code at /home/zong/code/pytorch/../scripts/dynamo.py:6 V1126 16:01:19.918000 1303438 torch/_dynamo/symbolic_convert.py:423] [1/0] [__graph_breaks] Reason: Unsupported: 'skip function graph_break in file /home/zong/code/pytorch/torch/_dynamo/decorators.py' ``` Using tlparse get stacktrace The trace log implement for graph breaks in `5318bf8baf/torch/_dynamo/symbolic_convert.py (L417-L424)` Get trace log by running ```bash TORCH_TRACE=/tmp/my_traced_log python test.py ``` Using tlparse to get report ``` tlparse dedicated_log_torch_trace_9unwqrxn.log -o out1 ``` Result ![image](https://github.com/user-attachments/assets/01d2ff25-90ec-4b9f-bcb6-5ae59ba65b35) strack info in `0_0_0/dynamo_graph_break_reason_0.txt ` ![image](https://github.com/user-attachments/assets/c4a04bd0-496a-4862-8230-c01f85e6f3c3) Pull Request resolved: https://github.com/pytorch/pytorch/pull/141553 Approved by: https://github.com/shink, https://github.com/ezyang	2024-11-27 01:26:11 +00:00
Edward Z. Yang	8c8a484d72	Add some symbolic shapes guard logs to tlparse by default (#140867 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/140867 Approved by: https://github.com/bdhirsh	2024-11-27 01:00:14 +00:00
Mark Tang	0221e3a960	Fix CTC cuda backend out-of-bound access (#141607 ) Fixes #140777 Pull Request resolved: https://github.com/pytorch/pytorch/pull/141607 Approved by: https://github.com/eqy	2024-11-27 00:53:02 +00:00
cyy	2f082e1e56	[13/N] Fix extra warnings brought by clang-tidy-17 (#140897 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/140897 Approved by: https://github.com/ezyang	2024-11-27 00:35:19 +00:00
bhack	1df440dc4e	Add `truediv` support in export serializer (#136364 ) Fixes #136113 - [x] Inital `truediv` coverage - [ ] Expand/reduce coverage? - [x] Add tests - [x] Re-check docstrings - [ ] Linting Pull Request resolved: https://github.com/pytorch/pytorch/pull/136364 Approved by: https://github.com/pianpwk Co-authored-by: Angela Yi <angelayi@meta.com> Co-authored-by: Pian Pawakapan <pianpwk@meta.com>	2024-11-27 00:31:47 +00:00
Nikita Shulga	9b89fa44ba	[MPS] Modify missing op message (#141314 ) To point to https://github.com/pytorch/pytorch/issues/141287 as well as reference commit hash (to clearly distringiush between OPs that has been implemented in trunk vs ones that are still missing) Pull Request resolved: https://github.com/pytorch/pytorch/pull/141314 Approved by: https://github.com/manuelcandales, https://github.com/albanD ghstack dependencies: #141313	2024-11-27 00:24:33 +00:00
Xuehai Pan	07850bb2c1	[dynamo][pytree][1/N] make CXX pytree traceable: `tree_iter` / `tree_leaves` (#137397 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137397 Approved by: https://github.com/jansel ghstack dependencies: #141360	2024-11-27 00:21:58 +00:00
Xuehai Pan	cdde73033e	[dynamo] fix generic namedtuple support when the class is created via `class MyTuple(NamedTuple, Generic[T]): ...` (#141360 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/141360 Approved by: https://github.com/jansel	2024-11-27 00:21:58 +00:00
Kiuk Chung	54f4621ca5	Add missing explicit include directive for <cerrno> in c10/util/error… (#141593 ) `c10/util/error.cpp` uses the symbol `errno` but is missing an explicit header include directive for `<cerrno>`. cc) @malfet , @atalman Pull Request resolved: https://github.com/pytorch/pytorch/pull/141593 Approved by: https://github.com/Skylion007	2024-11-27 00:00:23 +00:00
cyy	199d3da632	[9/N] Don't skip ASAN on some tests (#141534 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/141534 Approved by: https://github.com/ezyang	2024-11-26 23:52:53 +00:00
vasiliy	605392bd06	add float8 types to LoggingTensor (#141385 ) Summary: float8 dtypes were missing from this map, adding Test Plan: CI, and unbreaks debugging in torchao If there is an existing test I can add this to - lmk Reviewers: Subscribers: Tasks: Tags: Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/141385 Approved by: https://github.com/soulitzer	2024-11-26 23:39:57 +00:00
Kiuk Chung	5b0b16ca62	[torch/distributed] Make _SymmetricMemory.has_multicast_support() ret… (#141598 ) `SymmetricMemory.has_multicast_support()` throws an exception rather than returning `False` when called with a `DeviceType` that does not support. For example: ``` from torch._C._distributed_c10d import _SymmetricMemory from torch._C._autograd import DeviceType try: supports_multicast = _SymmetricMemory.has_multicast_support(DeviceType.CPU, 0) except RuntimeError as exc: assert str(exc) == "SymmetricMemory does not support device type cpu" ``` This is problematic when building PyTorch from source without `CUDASymmetricMemory.cu` since the [`@requires_multicast_support`](https://github.com/pytorch/pytorch/blob/main/torch/testing/_internal/common_distributed.py#L353) test decorator will throw an exception rather than skipping the test (as intended) This PR makes `_SymmetricMemory.has_multicast_support()` properly return `False` when multicast is not supported on the passed device. cc) @malfet , @atalman Pull Request resolved: https://github.com/pytorch/pytorch/pull/141598 Approved by: https://github.com/yifuwang	2024-11-26 23:36:32 +00:00
Mark Harfouche	43afaa4aac	Allow users to overwrite ld with environment variable in linker optimization script (#137331 ) This should help in the case of cross compilation. xref: https://github.com/conda-forge/pytorch-cpu-feedstock/pull/261 Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/137331 Approved by: https://github.com/isuruf, https://github.com/seemethere	2024-11-26 22:54:24 +00:00
Joel Schlosser	23793cf93d	NJT unsqueeze() fixes (#141392 ) This PR contains three `unsqueeze()`-related fixes for NJT: 1. Adjusts the output's `_ragged_idx` when `unsqueeze()` inserts a dim before the ragged dim 2. Corrects the unbind reference for `unsqueeze()` after the last input dim. For this case, the dim kwarg canonicalization logic needs to be applied wrt `inp.dim() + 1` to account for `dim=-1` properly 3. Adds ragged dim support to `unsqueeze()`, allowing for e.g. `(B, j1, D) -> (B, 1, j1, D)`. This is okay now after #137125 Note that `unsqueeze()` still doesn't support batch dim operation, and arguably should never support this. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141392 Approved by: https://github.com/cpuhrsch ghstack dependencies: #141500, #140736, #140161	2024-11-26 22:38:35 +00:00
Joel Schlosser	9ee5d6f83c	Initial NJT testing over dim type / views (#140161 ) This PR introduces `ExtraOpData`, a structure that contains op metadata regarding whether the op is a view and the dim-related args it accepts. It also populates a huge database for dim-wise / view ops with this info. Test logic (sample input generation, references) have been updated to utilize this data. It allows for a fairly generic set of sample inputs & a reference for the class of ops that accept a single NJT and operate dim-wise (AKA "unary dimwise ops"). Testing is added over the following ops: * `chunk()` * `narrow()` * `select()` * `split()` * `split_with_sizes()` * `squeeze()` * `unflatten()` * `unsqueeze()` Most of the above do not operate on the ragged / batch dims or on non-contiguous NJTs, so the proper xfails are added as needed. I also slipped in a couple minor fixes (sorry): 1. The `_wrap_jagged_dim()` helper now avoids assuming the `nt._ragged_idx == 1` and allows for a batch dim to be a valid input, disambiguating the converted inner dim as necessary through an additional `operating_on_batch` return value (i.e. both dim=0 and dim=1 map to dim=0 on the inner values tensor, since that dim represents a packed ragged dim for all batch items) 2. Padded dense -> NJT conversion requires shape gymnastics to operate with the restrictive FBGEMM kernel. The gymnastics were slightly wrong for the transposed NJT case, and this PR fixes that Pull Request resolved: https://github.com/pytorch/pytorch/pull/140161 Approved by: https://github.com/Skylion007, https://github.com/cpuhrsch ghstack dependencies: #141500, #140736	2024-11-26 22:08:08 +00:00
drisspg	7671dd436e	[SDPA-CPU] Fix Edge case w/ fused flash cpu kernel (#141519 ) Fixes https://github.com/pytorch/pytorch/issues/141128 Pull Request resolved: https://github.com/pytorch/pytorch/pull/141519 Approved by: https://github.com/jgong5, https://github.com/jbschlosser	2024-11-26 22:07:56 +00:00
Mark Saroufim	f3d16ec76f	Add doc preview command (#141590 ) Convenience, when we build pytorch docs 1. Docs for build weren't clear that `make html` is the main command intended to be ran 2. Once you run `make html` you need to visualize the work, opening up a simple http server seems like the simplest solution so adding a `make serve command` Usage ```shell numpy ❯ make serve PORT=8080 # Add port optionally Serving HTTP on :: port 8080 (http://[::]:8080/) ... ::1 - - [26/Nov/2024 10:05:41] "GET / HTTP/1.1" 200 - ::1 - - [26/Nov/2024 10:05:41] "GET /_static/copybutton.css HTTP/1.1" 200 - ::1 - - [26/Nov/2024 10:05:41] "GET /_static/katex-math.css HTTP/1.1" 200 - ``` ![Screenshot 2024-11-26 at 10 05 46 AM](https://github.com/user-attachments/assets/3b275c33-1515-4e21-b540-f5a68c8a8e55) Pull Request resolved: https://github.com/pytorch/pytorch/pull/141590 Approved by: https://github.com/svekars, https://github.com/malfet	2024-11-26 21:56:54 +00:00
PyTorch MergeBot	65dbd5cc2d	Revert "[Inductor] Inplacing with Donated Buffer (#140113 )" This reverts commit eecc8e362c2eb192cbe13322af941d09ca647a6b. Reverted https://github.com/pytorch/pytorch/pull/140113 on behalf of https://github.com/BoyuanFeng due to break test_donated_buffer_inplace internally since donated_buffer = False if is_fbcode() else True ([comment](https://github.com/pytorch/pytorch/pull/140113#issuecomment-2501954300))	2024-11-26 21:20:59 +00:00
Joel Schlosser	869d629c0f	Forward / backward NJT support for several activation functions (#140736 ) Several activation functions were unimplemented due to missing `pointwise` tags. This PR adds them and corresponding backwards implementations. Pull Request resolved: https://github.com/pytorch/pytorch/pull/140736 Approved by: https://github.com/Skylion007, https://github.com/cpuhrsch ghstack dependencies: #141500	2024-11-26 21:19:58 +00:00
Tristan Rice	9f4f061f89	PyProcessGroup: support rank, world size, group name/desc overrides (#141529 ) This improves `PyProcessGroup` so you can override rank, world size and group name/desc methods from Python. These will be needed to support resizable process groups in torchft. This also has some small fixes in test_c10d_pypg.py to use threads instead of processes which speeds up the test execution by ~10x. Test plan: ``` pytest test/distributed/test_c10d_pypg.py ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/141529 Approved by: https://github.com/fegin	2024-11-26 20:56:57 +00:00
Eli Uriegas	5696df439b	tools: Add script to do split build in one command (#141359 ) Usage: ```bash python3 tools/packaging/split_wheel.py bdist_wheel python3 tools/packaging/split_wheel.py install python3 tools/packaging/split_wheel.py develop ``` Ideally this should make it easier to do the split build locally while we're doing development. Signed-off-by: Eli Uriegas <eliuriegas@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/141359 Approved by: https://github.com/kit1980	2024-11-26 20:51:05 +00:00
Joel Schlosser	8ba555ec8a	Fix where() for NJT (#141500 ) Background: It's common to use `scalar_tensor()` in the input to `where()` to convert any scalars present to compatible tensors with matching options, including layout. This shows up in various places, notably including derivative formulas ([example](`78491d6afc/tools/autograd/derivatives.yaml (L432-L434)`)). It causes problems for NJTs because they have `layout=torch.jagged` and it never makes sense to create a scalar tensor with this layout. Some of the breakage only seems to happen in CI for reasons I don't fully understand (see the revert of #140736 due to softshrink's derivative formula). This PR: * Allows non-contiguous NJT inputs to `where()` + adds tests for this * Handles scalar tensor / dense tensor inputs for `condition` / `other` + adds tests for this * Uses limited `broadcast_tensors()` / `broadcast_to()` support * Improves `expand()` to work on non-contig NJTs * Changes `scalar_tensor()` to use `torch.strided` instead of `torch.jagged` in both eager and torch.compile (i.e. meta registration) * Changes backward formulas for `sinc`, `pow`, `special.i1`, and `special.i1e` to uses `scalar_tensor()` instead of e.g. `zeros({})` Alternative approach: Update all problematic usages of `scalar_tensor()` to avoid ever passing `layout=torch.jagged`. This is an extensive change and includes `torch.where()` logic, a bunch of derivative formulas, and likely other places not yet discovered. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141500 Approved by: https://github.com/malfet, https://github.com/cpuhrsch, https://github.com/soulitzer	2024-11-26 20:13:27 +00:00
Zhengxu Chen	011650adc5	[sigmoid] Refactor out a helper function to insert const graph into top level graph. (#140854 ) Summary: Add the helper function to put a const graph back to the toplevel graph, can be useful when we're taking const graphs from delegates. Test Plan: CI Reviewed By: trieuat Differential Revision: D63031982 Pull Request resolved: https://github.com/pytorch/pytorch/pull/140854 Approved by: https://github.com/SherlockNoMad	2024-11-26 20:07:46 +00:00
William Wen	6fa4356451	handle sympy.oo in bitwise_and/or value_ranges (#141522 ) An internal test is failing due to not handling `sympy.oo` properly in bitwise_and/or value_ranges: [T208684142](https://www.internalfb.com/intern/tasks/?t=208684142). I don't know how to repro this - seems like this requires inductor to trigger as well. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141522 Approved by: https://github.com/ezyang ghstack dependencies: #138777	2024-11-26 20:01:31 +00:00
Tsung-Hsien Lee	84f818f359	[DTensorTestbase] Fix `TestFunc` typing issue (#141513 ) Summary: `TestFunc` is annotated as `Callable[[object], object]` which represents a callable that takes a single argument of any type (`object`) and returns a value of any type (`object`). However, in reality, `TestFunc` could be any number of arguments, as a result, the corret typing should be `Callable[[...], object]` instead which represents a callable that takes any number of arguments (including zero) and returns a value of any type (`object`). Test Plan: Contbuild & OSS CI Differential Revision: D66463705 Pull Request resolved: https://github.com/pytorch/pytorch/pull/141513 Approved by: https://github.com/wz337, https://github.com/Skylion007	2024-11-26 19:48:34 +00:00
atalman	893a4390c9	Use cuda 12.6 wheels with Manylinux 2.28. Use Manylinux2014 for CPU, CUDA11.8, CUDA12.4 (#141565 ) For release 2.6 we will be using only CUDA 12.6 binaries on Manylinux 2.28. Issue: https://github.com/pytorch/pytorch/issues/123649 Pull Request resolved: https://github.com/pytorch/pytorch/pull/141565 Approved by: https://github.com/Skylion007, https://github.com/huydhn, https://github.com/malfet	2024-11-26 19:36:42 +00:00
eqy	816ca98cd2	[cuDNN][SDPA] Update cuDNN grad output layout check (#141147 ) Thanks to https://github.com/pytorch/pytorch/pull/137978 from @Skylion007 which bumps to cuDNN 9.5.1 the broken assumption of dO strides == O strides is fixed Note that there is still the restriction that the innermost stride of the grad output is 1 (this is almost always guaranteed because this condition is required of the input tensors). The main exception would be in test code that does e.g., `.sum().backward()` which yields grad output tensors with strides `[0, 0, 0, 0]`. CC @drisspg Pull Request resolved: https://github.com/pytorch/pytorch/pull/141147 Approved by: https://github.com/drisspg	2024-11-26 19:17:01 +00:00
Nichols A. Romero	a99332eb25	[ROCM] Support Multi-GPU offline tuning in TunableOp (#139673 ) This PR enhances offline tuning to support multi-GPUs. High-level description of algorithm: - Duplicate GEMMs are first eliminated - GEMMs are distributed to multi-GPUs for tuning - Results are gathered into a file with `_full` in the filename Also adding support for GemmAndBias and ScaledGemm Pull Request resolved: https://github.com/pytorch/pytorch/pull/139673 Approved by: https://github.com/jeffdaily, https://github.com/hongxiayang	2024-11-26 19:07:41 +00:00
fduwjj	5b4c864672	[c10d] Enable CudaEventCache by default and add multi device support (#140975 ) We added `CudaEventCache` in https://github.com/pytorch/pytorch/pull/133727 and this is a feature which tries to reuse CudaEvent so that we don't call destroy of CudaEvent which causes hang in the past. We had a bunch of tests and testing on TorchTitan and internal workload already. So far no errors or crash are found at the moment so we decide to roll out to all OSS users. For internal workload, this PR would not affect it because of some internal gating. Also we observed some multi-device use cases in OSS, so that we want to bring back multi-device support originally proposed in https://github.com/pytorch/pytorch/pull/122732/files. Pull Request resolved: https://github.com/pytorch/pytorch/pull/140975 Approved by: https://github.com/eqy, https://github.com/kwen2501	2024-11-26 18:42:45 +00:00
Isuru Fernando	44186a0a4e	Move Sympy printers to torch/utils/_sympy/printers.py (#140597 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/140597 Approved by: https://github.com/ezyang, https://github.com/anijain2305	2024-11-26 18:11:00 +00:00
Shivam Raikundalia	29ca44839e	Add skip_first_wait to profiler.schedule (V2) (#141512 ) Summary: Another try for D66198138. Original diff had some weird issue with type checking. Setting everything to int this time to get around it. Addresses https://github.com/pytorch/pytorch/issues/91888 We use wait as the amount you wait in between cycles when profiling and skip_first to delay the start of said profiling. However, once skip_first steps are completed, we immediately go to the wait phase. This is not problematic if wait is smaller than skip_first because we can just lower the values of skip_first, but if it is larger then we end up starting the first profile much later than desired. For example imagine a skip first of 1 and a wait of 100 with repeat of 2. We do want to wait 100 steps in between cycle 1 and 2 but we may not want to start warmup of cycle 1 at step 101 (forced because wait occurs directly after first steps skipped). This diff addresses this by adding a flag to skip the first wait. Adds new flag but sets to false by default so that existing impl is not affected. Test Plan: Got following traces with this schedule: schedule=torch.profiler.schedule( wait=10, warmup=3, active=1, repeat=1, skip_first=1, skip_first_wait=1 ) Differential Revision: D66465860 Pull Request resolved: https://github.com/pytorch/pytorch/pull/141512 Approved by: https://github.com/aaronenyeshi	2024-11-26 18:10:54 +00:00
atalman	809de05693	Update libgfortran version in aarch64 Docker (#141583 ) From `libgfortran-10-dev_10.5.0-1ubuntu1_arm64.deb` to `libgfortran-10-dev_10.5.0-4ubuntu2_arm64.deb` as former is no longer available: ``` % curl --head http://ports.ubuntu.com/ubuntu-ports/pool/universe/g/gcc-10/libgfortran-10-dev_10.5.0-1ubuntu1_arm64.deb HTTP/1.1 404 Not Found Date: Tue, 26 Nov 2024 16:58:10 GMT Server: Apache/2.4.29 (Ubuntu) Content-Type: text/html; charset=iso-8859-1 ``` vs ``` % curl --head http://ports.ubuntu.com/ubuntu-ports/pool/universe/g/gcc-10/libgfortran-10-dev_10.5.0-4ubuntu2_arm64.deb HTTP/1.1 200 OK Date: Tue, 26 Nov 2024 16:58:48 GMT Server: Apache/2.4.29 (Ubuntu) Last-Modified: Sun, 31 Mar 2024 10:51:08 GMT ETag: "713d4-614f2a681d48b" Accept-Ranges: bytes Content-Length: 463828 Content-Type: application/x-debian-package ``` Here is the failure: https://github.com/pytorch/pytorch/actions/runs/12032016986/job/33542862322 Pull Request resolved: https://github.com/pytorch/pytorch/pull/141583 Approved by: https://github.com/Skylion007, https://github.com/huydhn, https://github.com/malfet	2024-11-26 17:49:34 +00:00
Yidi Wu	000d4e9d43	[hop][inductor] remove codegen_subgraph_suffix and directly assign call function result to outer outputs (#141181 ) Before the PR: P1683356646 after the pr: P1683356585 Relevant changes: ``` @@ -231,7 +421,8 @@ true_graph_0_args = [true_graph_0_arg0_1, true_graph_0_arg1_1] del true_graph_0_arg0_1 del true_graph_0_arg1_1 + (buf5[0],) = true_graph_0(true_graph_0_args) - (true_graph_0_buf0,) = true_graph_0(true_graph_0_args) - buf5[0] = true_graph_0_buf0 else: # subgraph: false_graph_0 false_graph_0_arg0_1 = buf4 @@ -239,7 +430,8 @@ false_graph_0_args = [false_graph_0_arg0_1, false_graph_0_arg1_1] del false_graph_0_arg0_1 del false_graph_0_arg1_1 + (buf5[0],) = false_graph_0(false_graph_0_args) - (false_graph_0_buf0,) = false_graph_0(false_graph_0_args) - buf5[0] = false_graph_0_buf0 del arg2_1 del buf4 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/141181 Approved by: https://github.com/anijain2305 ghstack dependencies: #140334, #141172	2024-11-26 17:32:51 +00:00
Yidi Wu	aae581d921	[hop free symbols][inductor] remove un-used add_symbol_graph_inputs (#141172 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/141172 Approved by: https://github.com/Chillee ghstack dependencies: #140334	2024-11-26 17:32:50 +00:00
Yidi Wu	45bc9165fe	[hop] add discard_graph_changes to remove the empty calls before hop (#140334 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/140334 Approved by: https://github.com/zou3519	2024-11-26 17:32:43 +00:00
Boyuan Feng	eecc8e362c	[Inductor] Inplacing with Donated Buffer (#140113 ) Currently, inductor does not inplace update a buffer if it is an input buffer. Because we don't know if an input will be used by other functions. Donated buffer provides additional information that an input buffer will not be used by other functions. So we can inplace update donated buffer when possible. [Dashboard](https://hud.pytorch.org/benchmark/torchbench/inductor_dynamic?dashboard=torchinductor&startTime=Mon,%2011%20Nov%202024%2018:14:36%20GMT&stopTime=Mon,%2018%20Nov%202024%2018:14:36%20GMT&granularity=hour&mode=training&dtype=amp&deviceName=cuda%20(a100)&lBranch=bf/donated-buffer-inplace&lCommit=5df0769c00e6f9000caeb10fd5cbf0b165f69c2a&rBranch=main&rCommit=2b39a8db7741b816b03677a9c6fec1af05640dee) ![image](https://github.com/user-attachments/assets/f19d961f-7973-418e-9de8-5c2a97950478) ![image](https://github.com/user-attachments/assets/df3bd6a9-58b8-4e8a-8397-9e3b1de9adfe) Pull Request resolved: https://github.com/pytorch/pytorch/pull/140113 Approved by: https://github.com/eellison	2024-11-26 17:19:50 +00:00
Boyuan Feng	3ef031909f	[Donated Buffer] support metadata mutation ops (#141308 ) ### Background: `set(x,y)` changes the untyped storage of x to be the same as y. ```python import torch from torch._subclasses.fake_tensor import FakeTensorMode x1 = torch.ones(2,3) y1 = torch.ones(2,3) z1 = torch.ops.aten.set_.source_Tensor(x1, y1) fake_tensor_mode = FakeTensorMode() x2 = fake_tensor_mode.from_tensor(torch.ones(2,3)) y2 = fake_tensor_mode.from_tensor(torch.ones(2,3)) z2 = torch.ops.aten.set_.source_Tensor(x2, y2) print(f"x1: {x1.untyped_storage()._cdata}, y1: {y1.untyped_storage()._cdata}, z1: {z1.untyped_storage()._cdata}") print(f"x2: {x2.untyped_storage()._cdata}, y2: {y2.untyped_storage()._cdata}, z2: {z2.untyped_storage()._cdata}") # x1: 99973024, y1: 99973024, z1: 99973024 # x2: 112107232, y2: 112107232, z2: 112107232 ``` ### Error before this diff Consider this example: ```python import torch def fn(x): p = torch.nn.Parameter(x + 123) return p, p.sin() opt = torch.compile(fn, fullgraph=True) x = torch.ones(16, device="cuda", requires_grad=True) p, r = opt(x) r.sum().backward() ``` When running with `TORCH_LOGS=aot`, we have `set_` in the graph. ``` def forward(self, primals_1: "f32[16][1]cuda:0", primals_2: "f32[16][1]cuda:0"): # File: /home/boyuan/playground/inductor/donated_buffer.py:4 in fn, code: p = torch.nn.Parameter(x + 123) add: "f32[16][1]cuda:0" = torch.ops.aten.add.Tensor(primals_1, 123); primals_1 = None # File: /home/boyuan/playground/inductor/donated_buffer.py:5 in fn, code: return p, p.sin() sin: "f32[16][1]cuda:0" = torch.ops.aten.sin.default(add) # No stacktrace found for following nodes set_: "f32[16][1]cuda:0" = torch.ops.aten.set_.source_Tensor(primals_2, add); primals_2 = set_ = None return (sin, add) ``` `set_: "f32[16][1]cuda:0" = torch.ops.aten.set_.source_Tensor(primals_2, add)` should change the storage of `primals_2` to be the same as `add`. However, this is not true before this diff. We found different untyped_storage() for meta['val'] of `set_`, `add`, and `primals_2`. This also leads to an error with donated buffer (#130580), which checks alias by untyped_storage. Since `add` and `primals_2` have different untyped_storage (which is wrong), add is wrongly marked as donated buffer. ### Root Cause During tracing, we have args, kwargs, out, and proxy_args, proxy_kwargs, proxy_out. We use args and kwargs to compute `out = func(args, *kwargs)` ([Here](https://github.com/pytorch/pytorch/blob/main/torch/fx/experimental/proxy_tensor.py#L912)). Later, we set out to its proxy, essentially calling `proxy_out.node.meta["val"] = out.detach()`. Due to the detach, the storage change happens on args but not on proxy_args.node.meta["val"] when func is torch.ops.aten.set_. I repro'ed this behavior of detach in eager code. ```python import torch x = torch.ones(2,3) x_detach = x.detach() y = torch.ones(2,3) z = torch.ops.aten.set_.source_Tensor(x_detach, y) print(f"x: {x.untyped_storage()._cdata}, x_detach: {x_detach.untyped_storage()._cdata}, y: {y.untyped_storage()._cdata}, z: {z.untyped_storage()._cdata}") # x: 97023632, x_detach: 97026480, y: 97026480, z: 97026480 ``` To fix the issue, this PR manually resets node.meta["val"] if the storage has changed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141308 Approved by: https://github.com/bdhirsh	2024-11-26 17:06:46 +00:00
Ryan Guo	99a0e2b1a1	[dynamo] Trace through `dataclasses` by removing it from `BUILTIN_SKIPLIST` (#141294 ) Fixes #141261. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141294 Approved by: https://github.com/williamwen42, https://github.com/jansel	2024-11-26 17:05:23 +00:00
Stephen Matthews	2bbd984aa2	Fix typo in Reproducibility docs (#141341 ) Fixes trivial issue in the docs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141341 Approved by: https://github.com/svekars	2024-11-26 16:53:26 +00:00
Edward Z. Yang	42ab61241e	Add README for torch._inductor.runtime (#141492 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/141492 Approved by: https://github.com/jansel ghstack dependencies: #141491	2024-11-26 14:43:02 +00:00
Edward Z. Yang	94ff3985c9	AFAICT, compile workers never actually mocked torch (#141491 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/141491 Approved by: https://github.com/Skylion007, https://github.com/jansel	2024-11-26 14:43:02 +00:00
leslie-fang-intel	9d4c0527b3	[Inductor][CPP] Modularize the CPP GEMM Template (#141006 ) Summary Move the common template code, which may be reused in subsequent group GEMM templates, into the standalone sub-templates. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141006 Approved by: https://github.com/jgong5	2024-11-26 14:32:40 +00:00
Ting Lu	313c1b33c5	Update CUDA installation script to 12.6.3 (#141365 ) related to https://github.com/pytorch/pytorch/issues/138440 Pull Request resolved: https://github.com/pytorch/pytorch/pull/141365 Approved by: https://github.com/atalman	2024-11-26 13:49:51 +00:00
xinan.lin	9dd3b85d05	[Inductor XPU] Fix wrong device check before skip concat linear. (#140916 ) Fix #140917 Pull Request resolved: https://github.com/pytorch/pytorch/pull/140916 Approved by: https://github.com/EikanWang, https://github.com/eellison	2024-11-26 13:30:26 +00:00
xinan.lin	4742080ed9	[AOTI XPU] Enable Cpp wraper for Intel GPU. (#135318 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135318 Approved by: https://github.com/jgong5, https://github.com/EikanWang, https://github.com/guangyey, https://github.com/desertfire	2024-11-26 11:51:32 +00:00
ZhiweiYan-96	c418a9ac75	[Intel GPU] XPUInductorQuantizer for XPU int8 recipe customization (#139578 ) # Motivation This PR add `XPUInductorQuantizer`, which would defined the recipe of int8 quantization at XPU backend. # Detailed The `XPUInductorQuantizer` is class derived from `X86InductorQuantizer` as both quantizer would take the advantage of highly optimized operators in oneDNN library(qconv, qlinear, qconv/qlinear fusion). We share the same recipe as `X86InductorQuantizer`, so we would have same `annotate_xxxx` methods. So, in ideal situation, the `XPUInductorQuantizer` would have no class body as all implementation can inherit from base class. In this PR, we override the `annotate_xxx` method for operators that has NOT be implemented. All operators XPU backend does not implement would be fallbacked to fp32 implementation as the node in graph is a `dq-op-q` pairs. This would help provide good OOB usability for XPU backend. On the other hand, the implemented operators would uses `annotate_op` implemented in base class and could be lowered successfully. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139578 Approved by: https://github.com/EikanWang, https://github.com/leslie-fang-intel, https://github.com/CuiYifeng, https://github.com/jerryzh168 ghstack dependencies: #133080	2024-11-26 09:44:14 +00:00
PyTorch MergeBot	5318bf8baf	Revert "[sparse] add extra options to _cslt_spare_mm (#137427 )" This reverts commit f1451163ecd2bd014cb80a40c41c9999fbc94af8. Reverted https://github.com/pytorch/pytorch/pull/137427 on behalf of https://github.com/huydhn due to This looks like the test is still failing, plz do a rebase ([comment](https://github.com/pytorch/pytorch/pull/137427#issuecomment-2499918590))	2024-11-26 08:01:24 +00:00
cyy	6d4cd3e5f2	Remove linking of private cuda targets (#141463 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/141463 Approved by: https://github.com/malfet	2024-11-26 03:51:53 +00:00
ZhiweiYan-96	648f5d9dd9	[Intel GPU] qconv at XPU backend (#133080 ) # Motivation This PR enables the XPU quantized convolution. The operators it registers are `onednn::qconv_prepack`, `onednn::qconv1d_pointwise`, `onednn::qconv2d_pointwise`, `onednn::qconv3d_pointwise`. We share same operator schemas as Intel CPU backend as both would call kernels implemented in oneDNN library. # Details The implemented operators would be further integrated into pt2e quant flow. In this PR, we validated the kernel functionality via the UT in `test/inductor/test_mkldnn_pattern_matcher.py` where CPU backend defines a series of UT for quantized convolution. Also, we extend the device support for inductor lowering pass and inductor IR defined in `torch/_inductor/fx_passes/quantization.py` and `torch/_inductor/mkldnn_ir.py`. The overall picture would be that, CPU and GPU backend could share the general optimization pass(op fusion) and quantization inductor IR. After lowering, the final kernel would be dispatched to different implementation in oneDNN library. In this PR, we share the same int8 quantizer in CPU, namely, `X68InductorQuantizer`. In next PR #139578, we will append a `XPUIndcutorQuantizer` which will customized the pt2e behaviors at XPU backend. The capability of `XPUInductorQuantizer` would gradually grow along with the development of quantized operators in XPU. # Validation * UT testing ```bash python test/inductor/test_mkldnn_pattern_matcher.py -v \ -k test_qconv2d_xpu \ -k test_qconv2d_silu_xpu \ -k test_qconv2d_relu6_xpu \ -k test_qconv2d_hardtanh_xpu \ -k test_qconv2d_hardswish_xpu ``` * Runtime exemplification ```bash #qconv2d onednn_verbose,primitive,exec,gpu:0,convolution,jit:ir,forward_training,src_u8::blocked:acdb::f0 wei_s8::blocked:acdb::f0 bia_undef::undef::: dst_f32::blocked:acdb::f0,attr-scratchpad:user attr-scales:src0:0:f32+wei:1:f32 attr-zero-points:src0:0:s32 attr-post-ops:binary_add:f32:2+eltwise_linear:1,alg:convolution_direct,mb1_ic128oc128_ih6oh4kh3sh1dh0ph0_iw6ow4kw3sw1dw0pw0,0.0668945 #qconv2d_silu onednn_verbose,primitive,exec,gpu:0,convolution,jit:ir,forward_training,src_u8::blocked:acdb::f0 wei_s8::blocked:acdb::f0 bia_undef::undef::: dst_u8::blocked:acdb::f0,attr-scratchpad:user attr-scales:src0:0:f32+wei:1:f32 attr-zero-points:src0:0:s32 attr-post-ops:eltwise_swish:1+binary_add:f32:2+eltwise_linear:0.0124779:22,alg:convolution_direct,mb1_ic3oc128_ih8oh6kh3sh1dh0ph0_iw8ow6kw3sw1dw0pw0,0.0881348 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/133080 Approved by: https://github.com/guangyey, https://github.com/EikanWang, https://github.com/atalman	2024-11-26 02:24:30 +00:00
Nikita Shulga	f2d388eddd	[BE] Use `torch.special.expm1` (#141518 ) Instead of `torch.exp(x)-1`, as suggested by TorchFix Pull Request resolved: https://github.com/pytorch/pytorch/pull/141518 Approved by: https://github.com/kit1980	2024-11-26 01:47:11 +00:00
Yanbo Liang	dcd16bdc21	[Dynamo][autograd.Function] Use fake tensor prop to infer fwd output (#136184 ) Fixes #129963 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136184 Approved by: https://github.com/zou3519	2024-11-26 01:10:08 +00:00
cyy	6b60f4bc91	Fix some typos in cuda.cmake (#141462 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/141462 Approved by: https://github.com/peterbell10	2024-11-26 01:08:25 +00:00
Yifu Wang	6a22cae436	[IntraNodeComm] fix a recent breakage (#141200 ) - Pass `group_name` to `CUDASymmetricMemory::alloc()` instead of `CUDASymmetricMemory::rendezvous()`. We can only move the argument to rendezvous() once all the underlying operators do the same. - Added `float` to the allowlist for intra-node all-reduces. - Added a warning when `IntraNodeComm::rendezvous()` is performed with overlapping devices among participants. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141200 Approved by: https://github.com/weifengpy, https://github.com/kwen2501	2024-11-26 00:46:38 +00:00
Ryan Guo	583484b726	[dynamo] Fix and simplify hanlding of `Set.update` method (#141286 ) The old implementation of `SetVariable.call_method("update", ...)` was incorrectly becacuse it wouldn't handle iterable inputs. This patches removes the input type restriction altogether, and implements the method as a polyfill (like how most of the other set methods are handled). Fixes #141283. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141286 Approved by: https://github.com/anijain2305	2024-11-26 00:41:50 +00:00
Avik Chaudhuri	5d7c3701e4	fix non termination in unflatten + state (#141494 ) With largish systems of nn modules with buffers, sinking params suffered from some kind of exponential blowup that is easily fixed by using a set instead of a list to keep track of unlifted buffer placeholders. Test Plan: added random dag test that failed previously Differential Revision: D66457661 Pull Request resolved: https://github.com/pytorch/pytorch/pull/141494 Approved by: https://github.com/angelayi	2024-11-26 00:17:56 +00:00
Jithun Nair	9ccbd84316	Upgrade ROCm wheels to manylinux2_28 - 1 of 2 (docker images) (#140681 ) Fixes #140631 Highlights: * Use `cpu_final` base for ROCm in `.ci/docker/manywheel/Dockerfile_2_28` * Cleans up install_miopen.sh to remove old ROCm references * Install `gcc-gfortran` package to build magma for ROCm on almalinux Needs builder PR https://github.com/pytorch/builder/pull/2043 (merged) so that GCC_ABI expected value is updated. Pull Request resolved: https://github.com/pytorch/pytorch/pull/140681 Approved by: https://github.com/jeffdaily	2024-11-26 00:10:40 +00:00
Nikita Shulga	8f5ce865a4	[Build] Add `COMMIT_SHA` to `caffe2::GetBuildOptions` (#141313 ) Using the same `tools/generate_torch_version.py` script It's already available on Python level, but not on C++ one Please note, that updating commit hash will force recompilation of less than 10 files according to ``` % touch caffe2/core/macros.h; ninja -d explain -j1 -v -n torch_python ninja explain: output caffe2/torch/CMakeFiles/gen_torch_version doesn't exist ninja explain: caffe2/torch/CMakeFiles/gen_torch_version is dirty ninja explain: /Users/malfet/git/pytorch/pytorch/torch/version.py is dirty ninja explain: output third_party/kineto/libkineto/CMakeFiles/libkineto_defs.bzl of phony edge with no inputs doesn't exist ninja explain: third_party/kineto/libkineto/CMakeFiles/libkineto_defs.bzl is dirty ninja explain: output caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/Version.cpp.o older than most recent input /Users/malfet/git/pytorch/pytorch/build/caffe2/core/macros.h (1732301546390618881 vs 1732301802196214000) ninja explain: caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/Version.cpp.o is dirty ninja explain: output caffe2/CMakeFiles/torch_cpu.dir/core/common.cc.o older than most recent input /Users/malfet/git/pytorch/pytorch/build/caffe2/core/macros.h (1732301546233600752 vs 1732301802196214000) ninja explain: caffe2/CMakeFiles/torch_cpu.dir/core/common.cc.o is dirty ninja explain: output caffe2/CMakeFiles/torch_cpu.dir/serialize/inline_container.cc.o older than most recent input /Users/malfet/git/pytorch/pytorch/build/caffe2/core/macros.h (1732301546651089243 vs 1732301802196214000) ninja explain: caffe2/CMakeFiles/torch_cpu.dir/serialize/inline_container.cc.o is dirty ninja explain: output caffe2/CMakeFiles/torch_cpu.dir/serialize/file_adapter.cc.o older than most recent input /Users/malfet/git/pytorch/pytorch/build/caffe2/core/macros.h (1732301546224176845 vs 1732301802196214000) ninja explain: caffe2/CMakeFiles/torch_cpu.dir/serialize/file_adapter.cc.o is dirty ninja explain: output caffe2/CMakeFiles/torch_cpu.dir/utils/threadpool/ThreadPool.cc.o older than most recent input /Users/malfet/git/pytorch/pytorch/build/caffe2/core/macros.h (1732301546464535054 vs 1732301802196214000) ninja explain: caffe2/CMakeFiles/torch_cpu.dir/utils/threadpool/ThreadPool.cc.o is dirty ninja explain: output caffe2/CMakeFiles/torch_cpu.dir/__/torch/csrc/jit/runtime/static/impl.cpp.o older than most recent input /Users/malfet/git/pytorch/pytorch/build/caffe2/core/macros.h (1732301550062608920 vs 1732301802196214000) ninja explain: caffe2/CMakeFiles/torch_cpu.dir/__/torch/csrc/jit/runtime/static/impl.cpp.o is dirty ninja explain: output caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/mps/MPSFallback.mm.o older than most recent input /Users/malfet/git/pytorch/pytorch/build/caffe2/core/macros.h (1732301547538843492 vs 1732301802196214000) ninja explain: caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/mps/MPSFallback.mm.o is dirty ``` Differential Revision: [D66468257](https://our.internmc.facebook.com/intern/diff/D66468257) Pull Request resolved: https://github.com/pytorch/pytorch/pull/141313 Approved by: https://github.com/ezyang	2024-11-26 00:09:36 +00:00
PyTorch MergeBot	ad37afd590	Revert "Always unspecialize float in OSS (#138922 )" This reverts commit ba5253da9b30ed4d998cee1d865f92b2c27d3086. Reverted https://github.com/pytorch/pytorch/pull/138922 on behalf of https://github.com/yf225 due to perf regression on torchbench ([comment](https://github.com/pytorch/pytorch/pull/138922#issuecomment-2499277511))	2024-11-26 00:03:03 +00:00
PyTorch MergeBot	964655bf0c	Revert "Remove THC from OSS build (#134969 )" This reverts commit 9c7660be0ee155baf0cb7e1e67708dd784ac5796. Reverted https://github.com/pytorch/pytorch/pull/134969 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is breaking the installation of https://github.com/facebookresearch/detectron2/blob/main/detectron2/layers/csrc/deformable/deform_conv_cuda_kernel.cu#L76 ([comment](https://github.com/pytorch/pytorch/pull/134969#issuecomment-2499275378))	2024-11-26 00:00:12 +00:00
Jesse Cai	f1451163ec	[sparse] add extra options to _cslt_spare_mm (#137427 ) Summary: Splitting this PR into two, one for the cuSPARSELt improvements, and one for the inductor lowering. This PR adds in the additional cuSPARSELt bindings into pytorch. * `torch._cslt_sparse_mm_search` will be deprecated in a future PR, so a warning has been added * Added a header file for cuSPARSELtOps.cpp * max_id is now available in `torch.backends.cusparselt` via `torch.backends.cusparselt.get_max_alg_id()` * fixed meta registrations for float8 Test Plan: python test/test_sparse_semi_structured.py Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/137427 Approved by: https://github.com/cpuhrsch, https://github.com/eqy	2024-11-25 23:45:41 +00:00
Shangdi Yu	02990fe36b	Populate nn.module.stack in _fuse_conv_bn_qat (#141400 ) Summary: Populate nn.module.stack in _fuse_conv_bn_qat for replacement nodes that correspond to a `get_attr` node in the original graph. In new training ir , `get_attr` nodes don't have `nn_module_stack` in node meta anymore (because the get_attr nodes are de-duplicated, so one get_attr node can potential have users in different module stacks). We populate it by checking if "conv_input" or "conv_weight" replacement node has nn_module_stack. If not, we copy it from the conv node. Test Plan: CI ``` buck run fbcode//caffe2/test:quantization_pt2e -- -r test_preserve_nn_module_stack ``` Differential Revision: D66393517 Pull Request resolved: https://github.com/pytorch/pytorch/pull/141400 Approved by: https://github.com/angelayi	2024-11-25 23:41:28 +00:00
Jithun Nair	851edf208b	[ROCm] Remove gfx906 from CI docker build (#141523 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/141523 Approved by: https://github.com/jeffdaily	2024-11-25 22:23:28 +00:00
Ke Wen	915625307e	[PGNCCL] Record device index for GPU guarding during NCCLComm method calls (#141270 ) ### Motivation `ncclCommInitRank` needs GPU guard (documented in NCCL). `ncclCommAbort`, `ncclCommFinalize` and `ncclCommDestroy` may also need GPU guard (undocumented in NCCL); otherwise, extra CUDA context may be created (or worse, hang); both effects have been seen before in our tests. ### Solution This PR records a device index during `NCCLComm` object creation, so that we can add GPU guard in `NCCLComm`'s method calling which direct to the above NCCL APIs. ### Note This is not a bug fix. Just a safety improvement. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141270 Approved by: https://github.com/eqy ghstack dependencies: #141374	2024-11-25 21:31:21 +00:00
Ke Wen	af4522b81c	[c10d][CI] Use new store for PG restart tests (#141374 ) A new Store is used to recreate PGs upon restart. Achieve the new Store by adding prefix. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141374 Approved by: https://github.com/fduwjj	2024-11-25 21:31:21 +00:00
Xuehai Pan	b18bbc965c	[dynamo] support `list.sort` sort non-constant iterable with constant keys (#141485 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/141485 Approved by: https://github.com/jansel	2024-11-25 21:06:11 +00:00
Benjamin Glass	efec302dd0	cpp_wrapper tests: Fix tests assuming non-cpp_wrapper code (#141175 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/141175 Approved by: https://github.com/desertfire	2024-11-25 19:33:55 +00:00
Benjamin Glass	78491d6afc	Update triton wheel install script with new versioning (#141497 ) This PR is a follow-on to #141410. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141497 Approved by: https://github.com/huydhn	2024-11-25 19:09:55 +00:00
drisspg	91f7c547ec	[FlexAttention] add support for learnable biases in Inductor (#137452 ) # Summary The follow up PR to: https://github.com/pytorch/pytorch/pull/137526. In this pr, we actually update the lowerings for the flex_attention backwards kernel to generate fused backward gradient calculations for any captured buffers that require grads. We are doing this using tl.atomic_add to scatter the correct gradients into zeroed out buffer for any captured buffers that required grads. Added many test cases and found. Along the way found some masking bugs. There are likely some performance cliffs here, specifically with D-types and on different GPUs. Planned to do this in a follow-up and profile the current strategy. We are explicitly choosing reduced memory over increased performance right now. By using atomics, we do not need to realize a full attention scores matrix. However, this comes with two downsides. One, this is potentially slower in some cases, and two, the gradient calculation for any captured buffers is non-deterministic. ## Worked Example Lets do the case where you are reading from one bias that doesn't require grad and using this to index into another that does. ScoreMod: ```Python bias = torch.randn( params.seq_length, device=self.device, dtype=params.dtype, requires_grad=True, ) offset = torch.randint( 0, params.seq_length, (params.seq_length,), device=self.device, ) def score_mod(score, b, h, q_idx, kv_idx): return score + bias[offset[q_idx]] ``` I am removing all but the new subgraph injected into the backwards: ``` Python dsT = pT * (dpT - Di[None, :]) # ~~~~~~~~~~~~~~~~~~~ Apply joint modification ~~~~~~~~~~~~~~~~~~~ grad_scores = (dsT) # ~~~~~~~~~~~~~~~~~~~ Apply other buffer grad writes ~~~~~~~~~~~~~ idx_b = off_z idx_h = off_hq idx_m = m idx_n = n scatter_mask = offs_m1[None, :] < Q_LEN and offs_n1[:, None] < KV_LEN tmp4 = (dsT).to(tl.float32) tl.atomic_add(out_ptr1 + (tl.broadcast_to(tl.load(in_ptr16 + idx_m), tmp4.shape)), tmp4, scatter_mask, sem='relaxed') # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ``` ## Key points * We always accumulate to float 32 grad buffers regardless of the type in the forward. This is because we normally do all computation intra kernel w/ fp32 accumulation and we want the same behavior for atomic additions * We are currently restricted to 1 scatter in the kenrel. I have some ideas on fx rewrites that would remove this restrictions but for now have nice error message w/ work around and will leave as a follow up. * Will do more extensive performance/ memory profiling in a follow up. ### Toy E2E example I have a toy E2E training example PR in the gym for now: https://github.com/pytorch-labs/attention-gym/pull/84/ I plan to update to a realistic learnable bias before landing Pull Request resolved: https://github.com/pytorch/pytorch/pull/137452 Approved by: https://github.com/Chillee	2024-11-25 19:08:34 +00:00
Nikita Shulga	de6d69ec78	[MPS] Make MetalShaderLibrary usable from C++ (#141477 ) By guarding Metal framework include and defining all ObjC protocols to dummy `void*` Pull Request resolved: https://github.com/pytorch/pytorch/pull/141477 Approved by: https://github.com/Skylion007 ghstack dependencies: #141474, #141475, #141476	2024-11-25 18:40:55 +00:00
Nikita Shulga	953e5f9201	[MPS][BE] Add virtual destructor (#141476 ) As classes with virtual methods must have virtual destructors Pull Request resolved: https://github.com/pytorch/pytorch/pull/141476 Approved by: https://github.com/cyyever, https://github.com/Skylion007 ghstack dependencies: #141474, #141475	2024-11-25 18:40:55 +00:00
Nikita Shulga	b532a84be5	[MPS] Move MetalShaderLibrary to its own header (#141475 ) In preparation to be used from libtorch_python Pull Request resolved: https://github.com/pytorch/pytorch/pull/141475 Approved by: https://github.com/Skylion007 ghstack dependencies: #141474	2024-11-25 18:40:47 +00:00
Nikita Shulga	1bca1220de	[MPS][BE] Remove unused definitions (#141474 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/141474 Approved by: https://github.com/Skylion007	2024-11-25 18:40:40 +00:00
Jason Ansel	9a09011cd1	[inductor] Refactor dependencies.extract_loop_body_with_args (#141404 ) I plan to reuse this in a later PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141404 Approved by: https://github.com/yanboliang	2024-11-25 18:34:09 +00:00
Colin Peppler	8f5edcb75c	[CUTLASS] Lift shape & stride information as kernel args (#138611 ) Differential Revision: [D64773324](https://our.internmc.facebook.com/intern/diff/D64773324) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138611 Approved by: https://github.com/chenyang78	2024-11-25 17:52:33 +00:00
PyTorch MergeBot	2325749a89	Revert "export AOTI_TORCH_EXPORT on Windows. (#140030 )" This reverts commit 7a9d0e3c06781dda04a9cc3dcf56ff09cf472235. Reverted https://github.com/pytorch/pytorch/pull/140030 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/140030#issuecomment-2498670406))	2024-11-25 17:51:53 +00:00
Xiaodong Wang	4a378d77d4	[AMD] Add ncclRemoteError back (#141461 ) Summary: It looks RCCL does have the support for those two error types:: ncclRemoteError and ncclnProgress: https://github.com/ROCm/rccl/blob/develop/src/nccl.h.in#L57. And I do see my job throwing out those errors. But pytorch just said: ``` RuntimeError: Unconvertible NCCL type ``` Even though nccl says: ``` develop/src/init.cc.hip:502 NCCL WARN Attempt to use communicator before the previous operation returned ncclSuccess ``` Therefore just enabling those. Test Plan: CI Differential Revision: D66434341 Pull Request resolved: https://github.com/pytorch/pytorch/pull/141461 Approved by: https://github.com/eqy	2024-11-25 17:49:34 +00:00
Vicky Tsang	5ececd4caa	[ROCm] Select gpu targets according to PYTORCH_ROCM_ARCH when building AOTriton from source (#139432 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139432 Approved by: https://github.com/jithunnair-amd, https://github.com/jeffdaily Co-authored-by: Vicky Tsang <vtsang@amd.com>	2024-11-25 17:33:57 +00:00
Justin Chu	419b566e54	[ONNX] Use the torchlib opset number and fix opset import logic (#141413 ) - Update the ONNX IR `add_opset_imports` pass to remove the heuristics of taking the `max` of the seen opsets. Instead, it uses the torchlib default opset version for the model's opset_import. The version converter is able to take the true opset versions in the nodes and convert the model to the correct version. - Update all hard coding of opset 18 to instead query the default torchlib opset from onnxscript, introduced in https://github.com/microsoft/onnxscript/pull/1963 Fixes https://github.com/pytorch/pytorch/issues/141260 Pull Request resolved: https://github.com/pytorch/pytorch/pull/141413 Approved by: https://github.com/titaiwangms	2024-11-25 17:33:25 +00:00
Florian (Feuermagier)	4fa72168ea	FlopCounterMode: Decompose ops for inference mode (#138508 ) Fixes #126268 I've basically followed @ezyang suggestion (I think) to use `func.decompose(...)`. Since `__torch_dispatch__` won't be called a second time for the same op, I've added a second `TorchDispatchMode` (`_DecomposedCounterMode`) that simpy dispatches to the parent flop counter. Using `self` as the inner context manager is not possible, since the second call to `__enter__` would re-initialize the counter's tracking state. Let me know if there's something wrong with this implementation, since I'm quite unsure how the decomposition thing actually works :D Pull Request resolved: https://github.com/pytorch/pytorch/pull/138508 Approved by: https://github.com/ezyang Co-authored-by: Edward Z. Yang <ezyang@meta.com>	2024-11-25 16:53:10 +00:00
PyTorch MergeBot	cffeb83f15	Revert "Forward / backward NJT support for several activation functions (#140736 )" This reverts commit daaecb96d6b8049f8ca95974cd8a45b2fb9d4e28. Reverted https://github.com/pytorch/pytorch/pull/140736 on behalf of https://github.com/malfet due to Take 2, of stack revert your change but its tests are failing in trunk ([comment](https://github.com/pytorch/pytorch/pull/140736#issuecomment-2498479702))	2024-11-25 16:27:00 +00:00
PyTorch MergeBot	e0f9ec4a25	Revert "Initial NJT testing over dim type / views (#140161 )" This reverts commit 730caf0aed187ce5c1c36fae7e9ae1f700585280. Reverted https://github.com/pytorch/pytorch/pull/140161 on behalf of https://github.com/malfet due to Sorry for reverting your change but its tests are failing in trunk ([comment](https://github.com/pytorch/pytorch/pull/140736#issuecomment-2498358652))	2024-11-25 15:40:54 +00:00
PyTorch MergeBot	58727b6f5f	Revert "NJT unsqueeze() fixes (#141392 )" This reverts commit 48409a5cc6b14b6a5237beb6263a436d309afcd2. Reverted https://github.com/pytorch/pytorch/pull/141392 on behalf of https://github.com/malfet due to Sorry for reverting your change but its tests are failing in trunk ([comment](https://github.com/pytorch/pytorch/pull/140736#issuecomment-2498358652))	2024-11-25 15:40:54 +00:00
Sam Larsen	07906f2f2b	[logging] Move population of common MetricsContext fields to record_compilation_metrics (#141291 ) Summary: Fix outstanding TODOs related to logging of CompilationMetrics by moving the population of common fields to record_compilation_metrics() instead of populating those independently wherever we use a the metrics_context contextmanager: * Keep track of start and end time in MetricsContext and pass those to record_compilation_metrics() and populate those fields in that function. * Pass exception info to record_compilation_metrics() and populate those field in that function. * Add a new contextmanager, chromium_event_timed, to create the start/end "dynamo" event. This is important because I want this contextmanager to complete _after_ building the CompilationMetrics. * Populate the compile_id field centrally in record_compilation_metrics(). * Populate the structured_logging_overhead centrally in record_compilation_metrics(). * Add the CompilationMetrics to the current chromium event in record_compilation_metrics(), after all common fields have been added. In a future diff, I can also add _all_ compilation metrics to the chromium event. Test plan: Unit tests. Also see internal testing: * dynamo_compile: https://fburl.com/scuba/dynamo_compile/sandbox/jrascnf9 * pt2_compile_events: https://fburl.com/scuba/pt2_compile_events/l3jnla06 * tlparse: https://fburl.com/bq5a9nqs Pull Request resolved: https://github.com/pytorch/pytorch/pull/141291 Approved by: https://github.com/jamesjwu	2024-11-25 13:18:40 +00:00
Sun, Jiayi	a964f31d7b	[inductor] modify the heuristic for loop split optimization (#137550 ) ### Summary 1. Improve the heuristic for loop split optimization: The divisor needs to be an integer and cannot be too small (needs to be greater than 8, this threshold has been tuned). 2. Improve the heuristic for disabling vectorization: add quantity_threshold and relax ratio_threshold for the number of non-contiguous load/store/index_expr in the loop body. This PR will bring performance improvements for two torchbench models(functorch_dp_cifar10, opacus_cifar10) and one timm model(sebotnet33ts_256). Pull Request resolved: https://github.com/pytorch/pytorch/pull/137550 Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5, https://github.com/jansel	2024-11-25 09:16:30 +00:00
Joel Schlosser	48409a5cc6	NJT unsqueeze() fixes (#141392 ) This PR contains three `unsqueeze()`-related fixes for NJT: 1. Adjusts the output's `_ragged_idx` when `unsqueeze()` inserts a dim before the ragged dim 2. Corrects the unbind reference for `unsqueeze()` after the last input dim. For this case, the dim kwarg canonicalization logic needs to be applied wrt `inp.dim() + 1` to account for `dim=-1` properly 3. Adds ragged dim support to `unsqueeze()`, allowing for e.g. `(B, j1, D) -> (B, 1, j1, D)`. This is okay now after #137125 Note that `unsqueeze()` still doesn't support batch dim operation, and arguably should never support this. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141392 Approved by: https://github.com/cpuhrsch ghstack dependencies: #140736, #140161	2024-11-25 08:08:38 +00:00
Joel Schlosser	730caf0aed	Initial NJT testing over dim type / views (#140161 ) This PR introduces `ExtraOpData`, a structure that contains op metadata regarding whether the op is a view and the dim-related args it accepts. It also populates a huge database for dim-wise / view ops with this info. Test logic (sample input generation, references) have been updated to utilize this data. It allows for a fairly generic set of sample inputs & a reference for the class of ops that accept a single NJT and operate dim-wise (AKA "unary dimwise ops"). Testing is added over the following ops: * `chunk()` * `narrow()` * `select()` * `split()` * `split_with_sizes()` * `squeeze()` * `unflatten()` * `unsqueeze()` Most of the above do not operate on the ragged / batch dims or on non-contiguous NJTs, so the proper xfails are added as needed. I also slipped in a couple minor fixes (sorry): 1. The `_wrap_jagged_dim()` helper now avoids assuming the `nt._ragged_idx == 1` and allows for a batch dim to be a valid input, disambiguating the converted inner dim as necessary through an additional `operating_on_batch` return value (i.e. both dim=0 and dim=1 map to dim=0 on the inner values tensor, since that dim represents a packed ragged dim for all batch items) 2. Padded dense -> NJT conversion requires shape gymnastics to operate with the restrictive FBGEMM kernel. The gymnastics were slightly wrong for the transposed NJT case, and this PR fixes that Pull Request resolved: https://github.com/pytorch/pytorch/pull/140161 Approved by: https://github.com/Skylion007, https://github.com/cpuhrsch ghstack dependencies: #140736	2024-11-25 08:08:38 +00:00
Joel Schlosser	daaecb96d6	Forward / backward NJT support for several activation functions (#140736 ) Several activation functions were unimplemented due to missing `pointwise` tags. This PR adds them and corresponding backwards implementations. Pull Request resolved: https://github.com/pytorch/pytorch/pull/140736 Approved by: https://github.com/Skylion007, https://github.com/cpuhrsch	2024-11-25 08:08:31 +00:00
haozhe.zhu	d0fd42eb3a	[inductor] refine loop split logic (#128812 ) This PR aims to improves parallelization by collapsing vectorized loop. https://github.com/pytorch/pytorch/issues/122281 For such case, the parallel level is only `2`. And the vectorized loop cannot be collapsed. ``` #pragma omp for for(long x0=static_cast<long>(0L); x0<static_cast<long>(2L); x0+=static_cast<long>(1L)) { for(long x1=static_cast<long>(0L); x1<static_cast<long>(199984L); x1+=static_cast<long>(16L)) { auto tmp0 = at::vec::VectorizedN<int64_t,2>::loadu(in_ptr0 + static_cast<long>(x1 + (199985Lx0)), 16); tmp0.store(out_ptr0 + static_cast<long>(x1 + (209985Lx0)), 16); } #pragma omp simd simdlen(8) for(long x1=static_cast<long>(199984L); x1<static_cast<long>(199985L); x1+=static_cast<long>(1L)) { auto tmp0 = in_ptr0[static_cast<long>(x1 + (199985Lx0))]; out_ptr0[static_cast<long>(x1 + (209985Lx0))] = tmp0; } } ``` After this PR, we will gen code ``` #pragma omp for collapse(2) for(long x0=static_cast<long>(0L); x0<static_cast<long>(2L); x0+=static_cast<long>(1L)) { for(long x1=static_cast<long>(0L); x1<static_cast<long>(199985L); x1+=static_cast<long>(16L)) { if (x1 >= 0 && x1 <199984) { auto tmp0 = at::vec::VectorizedN<int64_t,2>::loadu(in_ptr0 + static_cast<long>(x1 + (199985Lx0)), 16); tmp0.store(out_ptr0 + static_cast<long>(x1 + (209985Lx0)), 16); } if (x1 >= 199984 && x1 <199985) { auto tmp0 = in_ptr0[static_cast<long>(x1 + (199985Lx0))]; out_ptr0[static_cast<long>(x1 + (209985Lx0))] = tmp0; } } } ``` ### Highlight For reduction case, we have some side-effect here. For below case, we vectorized `x1` dim and reduction at `x2` dim. ``` #pragma omp for for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(39L); x0+=static_cast<int64_t>(1L)) { for(int64_t x1=static_cast<int64_t>(0L); x1<static_cast<int64_t>(16L); x1+=static_cast<int64_t>(8L)) { { float tmp_acc0 = -std::numeric_limits<float>::infinity(); at::vec::Vectorized<float> tmp_acc0_vec = at::vec::Vectorized<float>(-std::numeric_limits<float>::infinity()); for(int64_t x2=static_cast<int64_t>(0L); x2<static_cast<int64_t>(18L); x2+=static_cast<int64_t>(1L)) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr1 + static_cast<int64_t>(x1 + (17Lx2) + (306Lx0)), 8); tmp_acc0_vec = at::vec::maximum(tmp_acc0_vec, tmp0); } [&] { __at_align__ std::array<float, 8> tmpbuf; tmp_acc0_vec.store(tmpbuf.data(), 8); #pragma GCC unroll 8 for (long x1_inner = 0; x1_inner < 8; x1_inner++) { out_ptr1[static_cast<int64_t>(x0 + (39Lx1) + (39Lx1_inner))] = tmpbuf[x1_inner]; } } () ; } } #pragma omp simd simdlen(4) for(int64_t x1=static_cast<int64_t>(16L); x1<static_cast<int64_t>(17L); x1+=static_cast<int64_t>(1L)) { { float tmp_acc0 = -std::numeric_limits<float>::infinity(); for(int64_t x2=static_cast<int64_t>(0L); x2<static_cast<int64_t>(18L); x2+=static_cast<int64_t>(1L)) { auto tmp0 = in_ptr1[static_cast<int64_t>(x1 + (17Lx2) + (306Lx0))]; tmp_acc0 = max_propagate_nan(tmp_acc0, tmp0); } out_ptr1[static_cast<int64_t>(x0 + (39Lx1))] = tmp_acc0; } } } ``` After collapse, the loop order will be `x1 -> x2 -> x1_tail_part`, thus we will need a `tmp_acc_arr` to store the reduction result for `x1_tail_part`. And for `reduction_stores`, we also need to check `x1`'s value like what we do in the `loopbody` since the `reduction_stores` happened between `x1` and `x2` loops. ``` #pragma omp for collapse(2) for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(39L); x0+=static_cast<int64_t>(1L)) { for(int64_t x1=static_cast<int64_t>(0L); x1<static_cast<int64_t>(17L); x1+=static_cast<int64_t>(8L)) { { float tmp_acc0_arr[8]; ######### need an array to hold acc result for tail part for (int i = 0; i < 8; i++) { tmp_acc0_arr[i] = -std::numeric_limits<float>::infinity(); } float tmp_acc0 = -std::numeric_limits<float>::infinity(); at::vec::Vectorized<float> tmp_acc0_vec = at::vec::Vectorized<float>(-std::numeric_limits<float>::infinity()); for(int64_t x2=static_cast<int64_t>(0L); x2<static_cast<int64_t>(18L); x2+=static_cast<int64_t>(1L)) { { if(C10_LIKELY(x1 >= static_cast<int64_t>(0) && x1 < static_cast<int64_t>(16L))) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr1 + static_cast<int64_t>(x1 + (17Lx2) + (306Lx0)), 8); tmp_acc0_vec = at::vec::maximum(tmp_acc0_vec, tmp0); } if(C10_UNLIKELY(x1 >= static_cast<int64_t>(16L) && x1 < static_cast<int64_t>(17L))) { for (long x1_tail = static_cast<int64_t>(16L); x1_tail < static_cast<int64_t>(17L); x1_tail++) { auto tmp0 = in_ptr1[static_cast<int64_t>(x1_tail + (17Lx2) + (306Lx0))]; tmp_acc0_arr[x1_tail - static_cast<int64_t>(16L)] = max_propagate_nan(tmp_acc0_arr[x1_tail - static_cast<int64_t>(16L)], tmp0); } } } } ############### reduction stores if(C10_LIKELY(x1 >= static_cast<int64_t>(0) && x1 < static_cast<int64_t>(16L))) { [&] { __at_align__ std::array<float, 8> tmpbuf; tmp_acc0_vec.store(tmpbuf.data(), 8); #pragma GCC unroll 8 for (long x1_inner = 0; x1_inner < 8; x1_inner++) { out_ptr1[static_cast<int64_t>(x0 + (39Lx1) + (39Lx1_inner))] = tmpbuf[x1_inner]; } } () ; } if(C10_UNLIKELY(x1 >= static_cast<int64_t>(16L) && x1 < static_cast<int64_t>(17L))) { for (long x1_tail = static_cast<int64_t>(16L); x1_tail < static_cast<int64_t>(17L); x1_tail++) { out_ptr1[static_cast<int64_t>(x0 + (39Lx1_tail))] = tmp_acc0_arr[x1_tail - static_cast<int64_t>(16L)]; } } } } } ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/128812 Approved by: https://github.com/jgong5	2024-11-25 04:46:07 +00:00
Nikita Shulga	2398e758d2	Fix access to `_msvccompiler` from newer distutils (#141363 ) Newer versions of distutils no longer import `_msvccompiler` upon init(on Windows platform, that was not the case on other platforms even before 74), but it's still accessible if one chooses to import it directly. Test plan: ``` % python -c 'from setuptools import distutils; print(distutils.__version__, hasattr(distutils, "_msvccompiler")); from distutils import _msvccompiler; import setuptools; print(setuptools.__version__, _msvccompiler.__file__)' 3.10.9 False 65.5.0 /usr/local/fbcode/platform010/Python3.10.framework/Versions/3.10/lib/python3.10/site-packages/setuptools/_distutils/_msvccompiler.py ``` and ``` % python -c 'from setuptools import distutils; print(distutils.__version__, hasattr(distutils, "_msvccompiler")); from distutils import _msvccompiler; import setuptools; print(setuptools.__version__, _msvccompiler.__file__)' 3.13.0 False 75.6.0 /Users/malfet/py312-venv/lib/python3.13/site-packages/setuptools/_distutils/_msvccompiler.py ``` Gave up trying to appease the linker, so rewrote it as following function: ```python def _get_vc_env(vc_arch: str) -> dict[str, str]: try: from setuptools import distutils # type: ignore[import] return distutils._msvccompiler._get_vc_env(vc_arch) # type: ignore[no-any-return] except AttributeError: from setuptools._distutils import _msvccompiler #type: ignore[import] return _msvccompiler._get_vc_env(vc_arch) # type: ignore[no-any-return] ``` This PR also undoes setuptools version restriction introduced by https://github.com/pytorch/pytorch/pull/136489 as premise for restriction is incorrect Fixes https://github.com/pytorch/pytorch/issues/141319 Pull Request resolved: https://github.com/pytorch/pytorch/pull/141363 Approved by: https://github.com/huydhn, https://github.com/atalman	2024-11-25 01:50:47 +00:00
zengxian	6ad0423758	[CI]Move inductor UT from avx512 runner to amx runner (#141206 ) According to https://github.com/pytorch/pytorch/issues/140208#issuecomment-2477813174, we need to run inductor UT on Sapphire Rapids runner to cover AMX Micro GEMM tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/141206 Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5, https://github.com/desertfire	2024-11-25 01:26:58 +00:00
cyy	9c7660be0e	Remove THC from OSS build (#134969 ) THC is not used in OSS version. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134969 Approved by: https://github.com/albanD	2024-11-25 00:39:42 +00:00
Atul Jangra	6a096a0b96	[PT2] Fix callbacks to account for entire execution in compilation (#141323 ) Summary: In SJD, we register the callbacks to get notified of an active compilation. Using this information, we can basically allow for an increase time for the training loop The callbacks currently do not account for entire time and in several cases, the end callback is not called at all. This leads to a bunch of APS jobs getting terminated incorrectly: https://fburl.com/scuba/mast_hpc_job_run_status/ondwzt2w In this diff, we basically install a context manager which will call the start and end callbacks, similar to how we log counters and other information. Test Plan: ``` buck2 run mode/opt //aps_models/examples/dlrm:dlrm_train_app -- --config-name train_mast_fsdp_torchdynamo launcher.data_project=apf_ai_infra launcher.fbl_entitlement=ai_infra_training_rnd_tc launcher.hardware=TC_ANY_80G ``` Led to https://www.internalfb.com/mlhub/pipelines/runs/mast/aps-atuljangra-ef2285ba9a?job_attempt=0&version=0&env=prod https://fburl.com/ai_infra/sv0a213y confirms that callback was correctly called and a lease was properly installed, which takes over the training loop lease. {F1965137027} Differential Revision: D66347023 Pull Request resolved: https://github.com/pytorch/pytorch/pull/141323 Approved by: https://github.com/ezyang	2024-11-24 22:31:04 +00:00
Richard Barnes	cb8c956b5f	Fix PyBind 2.10.4 compatibility issue in caffe2/torch/csrc/dynamo/guards.cpp +2 (#141456 ) Summary: See D65023502 and [here](https://fb.workplace.com/groups/mldp.users/permalink/8706556336131960/) for details. Test Plan: Sandcastle Reviewed By: itamaro Differential Revision: D66395491 Pull Request resolved: https://github.com/pytorch/pytorch/pull/141456 Approved by: https://github.com/Skylion007	2024-11-24 21:05:48 +00:00
Xuehai Pan	675735cfc9	[dynamo] match implementation for `sorted(...)` with CPython (#141227 ) ```python def sorted(iterable, /, *, key=None, reverse=False): seq = list(iterable) seq.sort(key=key, reverse=reverse) return seq ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/141227 Approved by: https://github.com/jansel, https://github.com/Skylion007 ghstack dependencies: #141224	2024-11-24 20:01:50 +00:00
cyy	259a00b727	[3/N] Replace at::detail::Array with std::array (#141324 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/141324 Approved by: https://github.com/ezyang	2024-11-24 18:17:34 +00:00
PyTorch MergeBot	e3cb167560	Revert "Add skip_first_wait to profiler.schedule (#141070 )" This reverts commit 9d83cab8a4f3a21a012303361bbee39318d241e0. Reverted https://github.com/pytorch/pytorch/pull/141070 on behalf of https://github.com/izaitsevfb due to oops, it's actually reverted internally ([comment](https://github.com/pytorch/pytorch/pull/141070#issuecomment-2496141168))	2024-11-24 18:03:50 +00:00
Shivam Raikundalia	9d83cab8a4	Add skip_first_wait to profiler.schedule (#141070 ) Summary: Addresses https://github.com/pytorch/pytorch/issues/91888 We use wait as the amount you wait in between cycles when profiling and skip_first to delay the start of said profiling. However, once skip_first steps are completed, we immediately go to the wait phase. This is not problematic if wait is smaller than skip_first because we can just lower the values of skip_first, but if it is larger then we end up starting the first profile much later than desired. For example imagine a skip first of 1 and a wait of 100 with repeat of 2. We do want to wait 100 steps in between cycle 1 and 2 but we may not want to start warmup of cycle 1 at step 101 (forced because wait occurs directly after first steps skipped). This diff addresses this by adding a flag to skip the first wait. Adds new flag but sets to false by default so that existing impl is not affected. Test Plan: Got reasonable traces with this schedule: schedule=torch.profiler.schedule( wait=10, warmup=3, active=1, repeat=1, skip_first=1, skip_first_wait=1 ) D66198138 Pull Request resolved: https://github.com/pytorch/pytorch/pull/141070 Approved by: https://github.com/aaronenyeshi, https://github.com/briancoutinho	2024-11-24 17:54:49 +00:00
Bob Ren	e34ff2cb4b	remove allow-untyped-defs from _inductor/bounds.py (#141440 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/141440 Approved by: https://github.com/Skylion007	2024-11-24 16:23:31 +00:00
ratnampa	3614d130dd	[XPU] Update XPU C Shim Header (#141086 ) Fixes https://github.com/pytorch/pytorch/issues/141268 Caused by these commits: `34b2165bdb` and `34e420519d` The windows XPU builds are failing: https://github.com/pytorch/pytorch/actions/runs/11922274722/job/33228175750 due to recent PR merge with changes in fallback ops: `34e420519d` This PR updates the XPU C Shim header file to overcome these build failures. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141086 Approved by: https://github.com/etaf, https://github.com/EikanWang, https://github.com/jansel, https://github.com/malfet, https://github.com/dvrogozh, https://github.com/desertfire	2024-11-24 12:24:35 +00:00
Edward Z. Yang	a87925cc7e	Fix AttributeError: 'int' object has no attribute 'node' due to constant prop (#141250 ) Fixes https://github.com/pytorch/pytorch/issues/140625 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/141250 Approved by: https://github.com/bobrenjc93	2024-11-24 08:20:04 +00:00
Justin Chu	51b6126f54	Bump onnxscript version in CI (#141412 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/141412 Approved by: https://github.com/titaiwangms	2024-11-24 06:51:48 +00:00
Kshiteej K	af47e05a96	[fx] make split_module work with keep_original_order=True and no-op graph (#141340 ) Fixes https://github.com/pytorch/pytorch/issues/140014 Pull Request resolved: https://github.com/pytorch/pytorch/pull/141340 Approved by: https://github.com/ezyang	2024-11-24 06:41:30 +00:00
cyy	4c1f50af5f	Modernize C++ code in aten/src/ATen/ (#141424 ) Clang-tidy modernize checkers were applied, and most changes were concatenation of namespaces. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141424 Approved by: https://github.com/eqy	2024-11-24 02:15:19 +00:00
Bob Ren	ba5253da9b	Always unspecialize float in OSS (#138922 ) Fixes https://github.com/pytorch/pytorch/issues/107277 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138922 Approved by: https://github.com/ezyang Co-authored-by: Edward Z. Yang <ezyang@meta.com>	2024-11-24 01:58:13 +00:00
Tugsbayasgalan Manlaibaatar	11c786dcb5	[BE] Make maybe_aliasing_or_mutating proper tag (#131990 ) For better tracking, we need to make maybe aliasing/mutating ops with proper tag. We need to special case native_batch_norm because it is not a CIA but has a wrong schema. I guess native_batch_norm will be removed at some point, so until then we just keep it around. D60347117 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131990 Approved by: https://github.com/bdhirsh	2024-11-24 00:12:49 +00:00
PyTorch MergeBot	c513f01516	Revert "Add skip_first_wait to profiler.schedule (#141070 )" This reverts commit 8b13ed594a2b9b0a994e8efd42b8f1e59372e499. Reverted https://github.com/pytorch/pytorch/pull/141070 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/141070#issuecomment-2495671689))	2024-11-23 22:22:24 +00:00
Jason Ansel	995e3079c9	[inductor] Fix for "Failed to find static RBLOCK" (#141434 ) Summary: I expect this to fix https://fb.workplace.com/groups/1075192433118967/permalink/1547962839175255/ Test Plan: Ask poster to confirm fix Differential Revision: D66413828 Pull Request resolved: https://github.com/pytorch/pytorch/pull/141434 Approved by: https://github.com/ezyang	2024-11-23 22:08:56 +00:00
Angela Yi	f6eeab7ea8	[export] Make unflattened module compileable (#141249 ) Test Plan: Fixes https://fb.workplace.com/groups/1028545332188949/permalink/1091988579177957/ Differential Revision: D66302806 Pull Request resolved: https://github.com/pytorch/pytorch/pull/141249 Approved by: https://github.com/avikchaudhuri	2024-11-23 18:46:01 +00:00
Jason Ansel	83116ec90c	[dynamo] Fix fbcode flakey test from asyncio warning (#141399 ) Summary: This was failing with a `/usr/local/fbcode/platform010/lib/python3.10/asyncio/events.py:666: DeprecationWarning` that seems unrelated. Test Plan: ``` buck2 test 'fbcode//mode/opt' fbcode//caffe2/test/dynamo:test_dynamo -- --exact 'caffe2/test/dynamo:test_dynamo - test_misc.py::InlineInbuiltNNModulesMiscTests::test_numpy_readonly_inline_inbuilt_nn_modules' --run-disabled ``` Differential Revision: D66394773 Pull Request resolved: https://github.com/pytorch/pytorch/pull/141399 Approved by: https://github.com/yanboliang	2024-11-23 18:16:50 +00:00
Oguz Ulgen	3473dfa698	Add triton_op test for user defined triton caching (#141407 ) Fix failing internal codecache test Pull Request resolved: https://github.com/pytorch/pytorch/pull/141407 Approved by: https://github.com/aorenste	2024-11-23 07:54:39 +00:00
Avik Chaudhuri	8b4ae29b1b	misc. fixes to unflatten (#141066 ) Handling of nested modules in unflatten had several bugs, which were caught by trying to preserve module call signatures for nested modules. * A module `k` encountered when calling `k.n()` before `k()` used to become an empty nn module. This caused some information to be dropped when `k()` was eventually called. Relatedly, we would also lose call counts for `k.n()` through different paths (say, when `k()` calls `n()`). * Deleting call-indexed modules and patching up their call sites was broken for nested modules when creating dispatcher modules, because of silliness when handling their fqns. An interesting aside is that we used random graph generation for testing some of these changes. A future PR will add the infra to create tests using these random graphs. Differential Revision: D66192799 Pull Request resolved: https://github.com/pytorch/pytorch/pull/141066 Approved by: https://github.com/angelayi	2024-11-23 07:31:51 +00:00
Jason Ansel	5268754ebd	[inductor] Default impl refactors to IRNode (#141321 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/141321 Approved by: https://github.com/yanboliang	2024-11-23 06:25:59 +00:00
Huy Do	bae9510307	Fix pytorch-triton nightly checksum shorthash (#141410 ) Binary build is failing in trunk after https://github.com/pytorch/pytorch/pull/139206 lands, for example, https://github.com/pytorch/pytorch/actions/runs/11981181986/job/33410250461#step:17:539. It's a bit tricky to spot the issue but the difference is between `3.2.0+35c6c7c628` set by PyTorch and `3.2.0+git35c6c7c6` from triton (look closely one has the length of 10, the other of 8 characters) Triton now has its own nightly build logic in https://github.com/triton-lang/triton/pull/4812 that takes only 8 characters by default while the original logic from PT took 10. So, PT nightly couldn't find the dependency. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141410 Approved by: https://github.com/seemethere, https://github.com/malfet	2024-11-23 04:56:40 +00:00
Christian Puhrsch	1f734bc90c	Add bfloat16 support to torch.bmm(NST, NST) (#141380 ) Adds bfloat16 support to torch.bmm(NST, NST) where NST is NestedTensor with the torch.strided (default) layout. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141380 Approved by: https://github.com/jbschlosser	2024-11-23 04:18:48 +00:00
PyTorch MergeBot	66f2550328	Revert "Fix pytorch-triton nightly checksum shorthand (#141410 )" This reverts commit 9f8a19172d3ec417f8a6dce57d62d2aacc36c07c. Reverted https://github.com/pytorch/pytorch/pull/141410 on behalf of https://github.com/huydhn due to There is still a small tweak that I need to do 35c6c7c628 is now git35c6c7c6 so a prefix is needed, going to revert and reland this ([comment](https://github.com/pytorch/pytorch/pull/141410#issuecomment-2495291851))	2024-11-23 04:16:39 +00:00
Jithun Nair	6cc22976a0	[ROCm][CI] upgrade CI and manywheel docker images to ROCm 6.2.4 (#140851 ) Fixes issue of long docker build times in PRs which trigger the docker build in regular PyTorch build jobs eg. https://github.com/pytorch/pytorch/actions/runs/11751388838/job/32828886198. These docker builds take a long time for ROCm6.2 because: 1. They are run on less capable machines (`c5.2xlarge`) instead of the beefier ones on which [docker-build workflows](`924c1fe3f3/.github/workflows/docker-builds.yml (L50)`) run (`c5.12xlarge`) 2. ROCm6.2 docker builds enabled building of MIOpen from source, which runs into [timeout of 90mins](`9abd4d95bb/.github/actions/calculate-docker-image/action.yml (L171)`): https://github.com/pytorch/pytorch/actions/runs/11751388838/job/32828886198#step:7:160 Pull Request resolved: https://github.com/pytorch/pytorch/pull/140851 Approved by: https://github.com/jeffdaily	2024-11-23 03:36:27 +00:00
Huy Do	9f8a19172d	Fix pytorch-triton nightly checksum shorthand (#141410 ) Binary build is failing in trunk after https://github.com/pytorch/pytorch/pull/139206 lands, for example, https://github.com/pytorch/pytorch/actions/runs/11981181986/job/33410250461#step:17:539. It's a bit tricky to spot the issue but the difference is between `3.2.0+35c6c7c628` set by PyTorch and `3.2.0+git35c6c7c6` from triton (look closely one has the length of 10, the other of 8 characters) Triton now has its own nightly build logic in https://github.com/triton-lang/triton/pull/4812 that takes only 8 characters by default while the original logic from PT took 10. So, PT nightly couldn't find the dependency. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141410 Approved by: https://github.com/seemethere, https://github.com/malfet	2024-11-23 03:25:52 +00:00
Oguz Ulgen	a8ab6b0938	Fix failing internal codecache test (#141405 ) When internal remote cache version was bumped to 11, this test started failing, I guess no one noticed it, and it got disabled. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141405 Approved by: https://github.com/aorenste	2024-11-23 02:01:02 +00:00
Colin L. Rice	1aea642393	pytorch/feature: Record if inductor fx cache is enabled (#141059 ) This uses the underlying infrastructure and records if the fx cache is enabled. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141059 Approved by: https://github.com/masnesral	2024-11-23 01:55:27 +00:00
Rafael Auler	68be990519	Flag TORCH_SDT_SEMAPHORE as being name resovable (#141191 ) Summary: Mirroring changes in D64604573, it appears this code in libcaffe2 is mostly a copy of folly's one. Copy of the original diff summary: This particular inline assembly use cannot be converted to a constraint template parameter (until llvm 18 / gcc 14), as there is no way (until those versions) to specify that a non-pic relocation is needed when compiling under pic. This inline assembly requires a non-pic relocation because it is being written to the .notes section, which is non .text. Differential Revision: D66038989 Pull Request resolved: https://github.com/pytorch/pytorch/pull/141191 Approved by: https://github.com/dcci	2024-11-23 01:39:44 +00:00
Howard Huang	eb954ef3f2	[pipelining] allow multiple backward grads (#140981 ) fixes https://github.com/pytorch/pytorch/issues/139404. The input grads get saved in a new `self.bwd_cache` container and get popped off after they are used in `backward_one_chunk` `python test/distributed/pipelining/test_schedule_multiproc.py -k test_pipeline_schedule_runtime_custom_sched` Pull Request resolved: https://github.com/pytorch/pytorch/pull/140981 Approved by: https://github.com/wconstab	2024-11-23 00:35:08 +00:00
PyTorch MergeBot	2e7ba0b194	Revert "Switch to using Python nested int (#141166 )" This reverts commit e2e8a7fa2e519433a4ec1071f80d2f6f843c6300. Reverted https://github.com/pytorch/pytorch/pull/141166 on behalf of https://github.com/clee2000 due to broke docs [GH job link](https://github.com/pytorch/pytorch/actions/runs/11980936976/job/33406870951) [HUD commit link](`e2e8a7fa2e`) ([comment](https://github.com/pytorch/pytorch/pull/141166#issuecomment-2495112297))	2024-11-22 23:54:36 +00:00
William Wen	ee7eaad5c3	[dynamo] add SymNode bitwise and/or (#138777 ) Fixes [T203472723](https://www.internalfb.com/intern/tasks/?t=203472723) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138777 Approved by: https://github.com/ezyang	2024-11-22 23:36:16 +00:00
PyTorch MergeBot	a8c90e5140	Revert "Always unspecialize float in OSS (#138922 )" This reverts commit 6d779d05492813da1c19ac0c562d0d5f8473f27e. Reverted https://github.com/pytorch/pytorch/pull/138922 on behalf of https://github.com/huydhn due to Sorry for reverting your change but there is some slow tests failing after this land ([comment](https://github.com/pytorch/pytorch/pull/138922#issuecomment-2495076878))	2024-11-22 23:18:36 +00:00
Eddie Yan	c328d200ff	[SDPA][CUDA] resync sm90+ priority order for SDPA with `test_export.py` (#141274 ) Since we deprioritized cuDNN SDPA, this test fails on `sm90+`. This PR just changes the expected backend for now. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141274 Approved by: https://github.com/drisspg	2024-11-22 23:16:41 +00:00
PyTorch MergeBot	0be0c944b1	Revert "Forward / backward NJT support for several activation functions (#140736 )" This reverts commit af70f5e04c69839a1a0e08942254c170dc4c3d61. Reverted https://github.com/pytorch/pytorch/pull/140736 on behalf of https://github.com/huydhn due to Sorry for reverting your change but its tests are failing in trunk ([comment](https://github.com/pytorch/pytorch/pull/140736#issuecomment-2495075871))	2024-11-22 23:15:55 +00:00
Nikita Shulga	4acc988630	Add `ciflow/inductor-cu126` label (#141377 ) No op to unblock the testing Pull Request resolved: https://github.com/pytorch/pytorch/pull/141377 Approved by: https://github.com/atalman, https://github.com/huydhn	2024-11-22 23:14:24 +00:00
Xuehai Pan	2aac2ec664	[dynamo] fix `sorted(...)` when `key` function is explicitly passed with `key=None` (#141224 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/141224 Approved by: https://github.com/jansel	2024-11-22 22:42:46 +00:00
Scott Wolchok	57eea3f8e2	Fix a -Wshadow warning in ATen/native/Math.h (#141361 ) Move declaration down to point where it's needed, don't redeclare. Differential Revision: [D66376820](https://our.internmc.facebook.com/intern/diff/D66376820/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/141361 Approved by: https://github.com/Skylion007	2024-11-22 22:33:04 +00:00
Aleksei Nikiforov	0ce0e44237	Add workaround for potential runners issue on s390x (#141239 ) More information is at https://gitlab.com/qemu-project/qemu/-/issues/2600 Pull Request resolved: https://github.com/pytorch/pytorch/pull/141239 Approved by: https://github.com/huydhn	2024-11-22 22:17:55 +00:00
soulitzer	e2e8a7fa2e	Switch to using Python nested int (#141166 ) Doesn't seem to noticeably slow down eager - TestNestedTensorSubclass tests with and without the PR finished in similar amounts of time (around 57s, 58s) Pull Request resolved: https://github.com/pytorch/pytorch/pull/141166 Approved by: https://github.com/ezyang	2024-11-22 22:12:25 +00:00
Joel Schlosser	af70f5e04c	Forward / backward NJT support for several activation functions (#140736 ) Several activation functions were unimplemented due to missing `pointwise` tags. This PR adds them and corresponding backwards implementations. Pull Request resolved: https://github.com/pytorch/pytorch/pull/140736 Approved by: https://github.com/Skylion007, https://github.com/cpuhrsch	2024-11-22 22:05:53 +00:00
Jason Ansel	5062bbcd86	[inductor] Add missing get_reads() method (#141310 ) Summary: This is a possible fix for https://fb.workplace.com/groups/1075192433118967/permalink/794017756161443/ Test Plan: ``` buck2 test 'fbcode//mode/opt' fbcode//ai_infra/distributed_ai/pyper_test_framework/pt2_staging_tests/sw_v2:smallworld_cmf_test -- --exact 'ai_infra/distributed_ai/pyper_test_framework/pt2_staging_tests/sw_v2:smallworld_cmf_test - test_train (ai_infra.distributed_ai.pyper_test_framework.pt2_staging_tests.sw_v2.smallworld_cmf_test.CmfTest)' ``` Differential Revision: D66340927 Pull Request resolved: https://github.com/pytorch/pytorch/pull/141310 Approved by: https://github.com/ezyang	2024-11-22 22:00:18 +00:00
drisspg	d16aa566ea	[FlexAttention] Speed up gradcheck tests (#141356 ) # Summary ### Before ```Shell 48.71s call test/inductor/test_flex_attention.py::TestFlexAttention::test_captured_score_mod_aot_eager_gradcheck_score_mod_name__head_offset_mode_aot_eager ``` ### After Speeds up grad check tests by 10x ```Shell 4.74s call test/inductor/test_flex_attention.py::TestFlexAttention::test_captured_score_mod_aot_eager_gradcheck_score_mod_name__head_offset_mode_aot_eager ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/141356 Approved by: https://github.com/BoyuanFeng ghstack dependencies: #141164, #141185	2024-11-22 21:18:21 +00:00
angelayi	32583d915e	[export] Improve stacktrace filtering (#141285 ) Differential Revision: [D66321127](https://our.internmc.facebook.com/intern/diff/D66321127) Pull Request resolved: https://github.com/pytorch/pytorch/pull/141285 Approved by: https://github.com/yushangdi ghstack dependencies: #141071, #141072	2024-11-22 20:55:04 +00:00
angelayi	53df1c11cd	[export] Add custom op guards (#141072 ) For custom ops that do not have a meta kernel, draft export automatically creates a meta kernel based on the tracing example inputs. To ensure that these assumptions made during tracing is clear to the user, we add assertions into the traced exported program: An example graph: ``` ExportedProgram: class GraphModule(torch.nn.Module): def forward(self, a: "f32[s0, s1]", b: "f32[s2, s3]"): # File: /data/users/angelayi/pytorch/test/export/test_draft_export.py:172 in forward, code: res1 = torch.ops.mylib.foo4(a, b) _assert_tensor_metadata = torch.ops.aten._assert_tensor_metadata(a, dtype = torch.float32, device = device(type='cpu')); _assert_tensor_metadata = None _assert_tensor_metadata_1 = torch.ops.aten._assert_tensor_metadata(b, dtype = torch.float32, device = device(type='cpu')); _assert_tensor_metadata_1 = None foo4: "f32[u2, u3]" = torch.ops.mylib.foo4.default(a, b); a = b = None return (foo4,) ``` Differential Revision: [D66321129](https://our.internmc.facebook.com/intern/diff/D66321129) Pull Request resolved: https://github.com/pytorch/pytorch/pull/141072 Approved by: https://github.com/pianpwk ghstack dependencies: #141071	2024-11-22 20:55:04 +00:00
angelayi	0fbc0830ba	[export] Add device and dtype fields to assert_tensor_metadata (#141071 ) Differential Revision: [D66321128](https://our.internmc.facebook.com/intern/diff/D66321128) Pull Request resolved: https://github.com/pytorch/pytorch/pull/141071 Approved by: https://github.com/yushangdi, https://github.com/zou3519	2024-11-22 20:54:55 +00:00
Jovian Anthony Jaison	45d62d6fc5	[dynamo] Added cuda and triton versions to dynamo_compile (#141290 ) Opening another PR since #141140 was reverted. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141290 Approved by: https://github.com/masnesral	2024-11-22 20:04:42 +00:00
Xuehai Pan	2a6eaa2e6f	Refactor nightly pull tool to use `venv` and `pip` (#141281 ) Resolves #141238 - #141238 Example output: ```console $ python3.12 tools/nightly.py checkout -b my-nightly-branch -p my-env --python python3.10 log file: /Users/PanXuehai/Projects/pytorch/nightly/log/2024-11-22_04h15m45s_63f8b29e-a845-11ef-bbf9-32c784498a7b/nightly.log Creating virtual environment Creating venv (Python 3.10.15): /Users/PanXuehai/Projects/pytorch/my-env Installing packages Upgrading package(s) (https://download.pytorch.org/whl/nightly/cpu): pip, setuptools, wheel Installing packages took 5.576 [s] Creating virtual environment took 9.505 [s] Downloading packages Downloading package(s) (https://download.pytorch.org/whl/nightly/cpu): torch Downloaded 9 file(s) to /var/folders/sq/7sf73d5s2qnb3w6jjsmhsw3h0000gn/T/pip-download-lty5dvz4: - mpmath-1.3.0-py3-none-any.whl - torch-2.6.0.dev20241121-cp310-none-macosx_11_0_arm64.whl - jinja2-3.1.4-py3-none-any.whl - sympy-1.13.1-py3-none-any.whl - MarkupSafe-3.0.2-cp310-cp310-macosx_11_0_arm64.whl - networkx-3.4.2-py3-none-any.whl - fsspec-2024.10.0-py3-none-any.whl - filelock-3.16.1-py3-none-any.whl - typing_extensions-4.12.2-py3-none-any.whl Downloading packages took 7.628 [s] Installing dependencies Installing packages Installing package(s) (https://download.pytorch.org/whl/nightly/cpu): numpy, cmake, ninja, packaging, ruff, mypy, pytest, hypothesis, ipython, rich, clang-format, clang-tidy, sphinx, mpmath-1.3.0-py3-none-any.whl, jinja2-3.1.4-py3-none-any.whl, sympy-1.13.1-py3-none-any.whl, MarkupSafe-3.0.2-cp310-cp310-macosx_11_0_arm64.whl, networkx-3.4.2-py3-none-any.whl, fsspec-2024.10.0-py3-none-any.whl, filelock-3.16.1-py3-none-any.whl, typing_extensions-4.12.2-py3-none-any.whl Installing packages took 42.514 [s] Installing dependencies took 42.515 [s] Unpacking wheel file Unpacking wheel file took 3.223 [s] Checking out nightly PyTorch Found released git version ac47a2d9714278889923ddd40e4210d242d8d4ee Found nightly release version e0482fdf95eb3ce679fa442b50871d113ceb673b Switched to a new branch 'my-nightly-branch' Checking out nightly PyTorch took 0.198 [s] Moving nightly files into repo Linking /var/folders/sq/7sf73d5s2qnb3w6jjsmhsw3h0000gn/T/wheel-dljxil5i/torch-2.6.0.dev20241121/torch/_C.cpython-310-darwin.so -> /Users/PanXuehai/Projects/pytorch/torch/_C.cpython-310-darwin.so Linking /var/folders/sq/7sf73d5s2qnb3w6jjsmhsw3h0000gn/T/wheel-dljxil5i/torch-2.6.0.dev20241121/torch/lib/libtorch_python.dylib -> /Users/PanXuehai/Projects/pytorch/torch/lib/libtorch_python.dylib ... Linking /var/folders/sq/7sf73d5s2qnb3w6jjsmhsw3h0000gn/T/wheel-dljxil5i/torch-2.6.0.dev20241121/torch/include/c10/macros/Macros.h -> /Users/PanXuehai/Projects/pytorch/torch/include/c10/macros/Macros.h Moving nightly files into repo took 11.426 [s] Writing pytorch-nightly.pth Writing pytorch-nightly.pth took 0.036 [s] ------- PyTorch Development Environment set up! Please activate to enable this environment: $ source /Users/PanXuehai/Projects/pytorch/my-env/bin/activate ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/141281 Approved by: https://github.com/seemethere	2024-11-22 20:03:55 +00:00
Jason Ansel	75cecba164	[inductor] Move fusion heuristics to V.choices (#141108 ) This is a refactor to enable out of tree autotuners. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141108 Approved by: https://github.com/yanboliang	2024-11-22 19:53:07 +00:00
Simon Fan	d8c14838f1	[ca] dead code elimination for compile time (#141289 ) Although these nodes are eventually inlined away, they increase compile time, especially when initial CA graph capture treats all shapes as dynamic Pull Request resolved: https://github.com/pytorch/pytorch/pull/141289 Approved by: https://github.com/jansel ghstack dependencies: #141152, #141153	2024-11-22 19:26:27 +00:00
Simon Fan	db4e8a1d8a	[ca] expose option to collect sizes as dynamic (#141153 ) This is to address recompiles from eager nodes that saved dynamic activations Pull Request resolved: https://github.com/pytorch/pytorch/pull/141153 Approved by: https://github.com/jansel ghstack dependencies: #141152	2024-11-22 19:26:27 +00:00
Simon Fan	1024a1c3d1	[ca] fix dynamic shape logging (#141152 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/141152 Approved by: https://github.com/jansel	2024-11-22 19:26:27 +00:00
Tugsbayasgalan Manlaibaatar	7c5c38da23	Fix constant lifting pass when there is no user input (#141157 ) Differential Revision: [D66253854](https://our.internmc.facebook.com/intern/diff/D66253854/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/141157 Approved by: https://github.com/zhxchen17	2024-11-22 19:08:25 +00:00
Menglu Yu	40d0740e73	[PT2][Optimus] Fix a corner case in merge splits (#141194 ) Summary: We find another corner case in the merge splits, where the first split node does not have consecutive getitem indices, we need to skip such cases. {F1964255863} Test Plan: # local reproduce ``` buck2 run mode/opt //scripts/jackiexu0313/pt2:local_model_with_pt2 -- --test_mode batch-split --flow_id 666002198 2>&1 \| tee ~/cmf.txt ``` P1683429791 Differential Revision: D66275387 Pull Request resolved: https://github.com/pytorch/pytorch/pull/141194 Approved by: https://github.com/jackiexu1992	2024-11-22 19:04:40 +00:00
Pian Pawakapan	e54538afc8	[export] fix sympy.expr roundtrippability for serialization (#141284 ) Summary: Latest attempt after [136802](https://github.com/pytorch/pytorch/pull/136802) and [140084](https://github.com/pytorch/pytorch/pull/140084) got shelved. This keeps the string format for `expr_str`, but calls `sympy.printing.repr.srepr(s)` instead of `str(s)`, which prints expressions more explicitly, e.g. ``` ((2x)//(3y + 4)) -> "FloorDiv(Mul(Integer(2), Symbol('x')), Add(Mul(Integer(3), Symbol('y')), Integer(4)))" ``` This is nice because: - we have better roundtrippability for deserialization, robust to pretty printing changes like [this](`6c9bfd52b6/torch/utils/_sympy/functions.py (L208)`) that caused the issue in the first place. - this preserves the BC surface for both 1) sigmoid thrift serialization, by keeping the string format, and 2) deserialization for old IRs, since `sympy.sympify(...)` still handles the old `str(s)` format. - more memory efficient than storing ASTs; the [AST attempt](https://github.com/pytorch/pytorch/pull/140084) increased artifact size by 20% on some toy programs. - doesn't even require a schema version bump. Additionally to push some test cases over the line, this redoes expression processing (handling ranges, symbol caching) by doing bottom-up processing instead of the current hacky-ish workflow. Test Plan: test_serdes, test_serialize, internal tests broken by AST PR Differential Revision: D66283208 Pull Request resolved: https://github.com/pytorch/pytorch/pull/141284 Approved by: https://github.com/zhxchen17	2024-11-22 18:47:04 +00:00
Ke Wen	e6962f8f19	[c10d] Relax CUDA context test criteria (#141298 ) After `destroy_process_group`, it may be possible that the CUDA context finishes its job and exits, thus NVML detects 0 processes on the device. This PR relaxes the current check condition (there must be exactly 1 active process on that device) to cover this possibility. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141298 Approved by: https://github.com/eqy	2024-11-22 18:38:25 +00:00
Bert Maher	57fc070e08	[triton] Update pin for PyTorch 2.6/Triton 3.2 (#139206 ) Bump the Triton pin to the release candidate commit for Triton 3.2. A few changes beyond the pin bump itself are needed: * Remove the script that adds a git version hash suffix to the Triton wheel, since as of https://github.com/triton-lang/triton/pull/4812 Triton adds that itself * Add `pybind11` to the Triton build setup, since Triton now depends on it * Use manylinux-2.28 for the Triton wheel builder, and use clang+lld for building to pick up the right glibc Pull Request resolved: https://github.com/pytorch/pytorch/pull/139206 Approved by: https://github.com/malfet, https://github.com/atalman Co-authored-by: Andrey Talman <atalman@fb.com>	2024-11-22 18:34:32 +00:00
Zhengxu Chen	313dac6c1c	[export] Fix name inconsistentcy between thrift and schema.py (#141151 ) Summary: The struct type is named "InputToConsantInputSpec" in thrift which causes some inconsistency between the schema. Changing the type name from 1 to another is okayish because that doesn't change the on wire format. Test Plan: CI Differential Revision: D66240951 Pull Request resolved: https://github.com/pytorch/pytorch/pull/141151 Approved by: https://github.com/yiming0416	2024-11-22 18:04:23 +00:00
PyTorch MergeBot	44d5012a80	Revert "[triton] Update pin for PyTorch 2.6/Triton 3.2 (#139206 )" This reverts commit c93e57efac091f246b599b4fcdc189ed94753b43. Reverted https://github.com/pytorch/pytorch/pull/139206 on behalf of https://github.com/atalman due to Will revert and reland skipping xpu builds ([comment](https://github.com/pytorch/pytorch/pull/139206#issuecomment-2494437857))	2024-11-22 18:01:18 +00:00
Bob Ren	6d779d0549	Always unspecialize float in OSS (#138922 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138922 Approved by: https://github.com/ezyang Co-authored-by: Edward Z. Yang <ezyang@meta.com>	2024-11-22 17:54:42 +00:00
PyTorch MergeBot	2239d1a7a3	Revert "[CI, 3.13] enable 3.13 CI (#139533 )" This reverts commit b7a25c1ee7cdb559516db2b10279c996742a1708. Reverted https://github.com/pytorch/pytorch/pull/139533 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing test_cpp_extensions_open_device_registration. The test was wrongly excluded by TD ([comment](https://github.com/pytorch/pytorch/pull/139533#issuecomment-2494328806))	2024-11-22 17:18:49 +00:00
PyTorch MergeBot	cf1d95a965	Revert "Add option to split Linear gates for Quantizable LSTM into separate ops (#140868 )" This reverts commit 3fcf66f61fbc8f760fc0d34356a60b76c3f2e27c. Reverted https://github.com/pytorch/pytorch/pull/140868 on behalf of https://github.com/huydhn due to Sorry for reverting your change but I think lint is failing on this in trunk ([comment](https://github.com/pytorch/pytorch/pull/140868#issuecomment-2494076202))	2024-11-22 15:54:05 +00:00
PyTorch MergeBot	080f992d68	Revert "[CI] Reduce distributed test timeout to 60s (#141168 )" This reverts commit e8de8f3969bf935442378efd125442de90e78431. Reverted https://github.com/pytorch/pytorch/pull/141168 on behalf of https://github.com/huydhn due to Sorry for reverting your change but I think we missed inductor tests ([comment](https://github.com/pytorch/pytorch/pull/141168#issuecomment-2494060624))	2024-11-22 15:46:37 +00:00
PyTorch MergeBot	f23621ec56	Revert "Move Sympy printers to torch/utils/_sympy/printers.py (#140597 )" This reverts commit c25b201583fc28243b87c460a2f18e2531a676e7. Reverted https://github.com/pytorch/pytorch/pull/140597 on behalf of https://github.com/huydhn due to Trunk is sad again after this lands, this looks like a landrace this time, so please do a rebase ([comment](https://github.com/pytorch/pytorch/pull/140597#issuecomment-2494052978))	2024-11-22 15:43:39 +00:00
PyTorch MergeBot	cc90ba8924	Revert "[sparse] add extra options to _cslt_spare_mm (#137427 )" This reverts commit 45b30a5aecf31ec26d9b2dc86d5170f9618a7766. Reverted https://github.com/pytorch/pytorch/pull/137427 on behalf of https://github.com/huydhn due to Sorry for reverting your change but test_sparse_semi_structured is failing in trunk after it lands ([comment](https://github.com/pytorch/pytorch/pull/137427#issuecomment-2494047577))	2024-11-22 15:40:21 +00:00
Xu Han	7a9d0e3c06	export AOTI_TORCH_EXPORT on Windows. (#140030 ) Fixes #139954 reproduce UT: ```cmd pytest test/inductor/test_torchinductor_codegen_dynamic_shapes.py -k test_device_assert_dynamic_shapes_cpu ``` Issue: <img width="856" alt="image" src="https://github.com/user-attachments/assets/5fc501a9-54e5-45ac-9fb3-509ec11a7abe"> After fixing: ![Image](https://github.com/user-attachments/assets/883846fb-8e92-4b9c-9400-daab32382a3a) Pull Request resolved: https://github.com/pytorch/pytorch/pull/140030 Approved by: https://github.com/jgong5, https://github.com/desertfire	2024-11-22 15:07:46 +00:00
Bert Maher	c93e57efac	[triton] Update pin for PyTorch 2.6/Triton 3.2 (#139206 ) Bump the Triton pin to the release candidate commit for Triton 3.2. A few changes beyond the pin bump itself are needed: * Remove the script that adds a git version hash suffix to the Triton wheel, since as of https://github.com/triton-lang/triton/pull/4812 Triton adds that itself * Add `pybind11` to the Triton build setup, since Triton now depends on it * Use manylinux-2.28 for the Triton wheel builder, and use clang+lld for building to pick up the right glibc Pull Request resolved: https://github.com/pytorch/pytorch/pull/139206 Approved by: https://github.com/malfet, https://github.com/atalman Co-authored-by: Andrey Talman <atalman@fb.com>	2024-11-22 14:50:22 +00:00
William Wen	b7a25c1ee7	[CI, 3.13] enable 3.13 CI (#139533 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139533 Approved by: https://github.com/atalman, https://github.com/malfet	2024-11-22 14:43:02 +00:00
Zhenbin Lin	e0d97e936a	OpenReg: Fix releasing tensor issue when exiting process (#140936 ) When executing the following code: ``` import pytorch_openreg import torch if __name__ == "__main__": a = torch.tensor(1, device="openreg") ``` Sometimes releases tensor a failed after the process finishes executing `main` function. The trace of releasing `a` is `~Tensor()` -> ... -> `OpenRegMem.cpp` -> `OpenRegHooks.cpp` -> `_aten_impl.py`. There are two failed scenarios I've found: 1. Segmentation fault: Before executing `~Tensor()`, the process has released global variables in `_aten_impl.py`, which causes the issue. 2. Waiting indefinitely: The main process passes the `free ptr` command to deamon process, however daemon processes have shutdown. The way to fix this issue is when the process is shutting down, we ignore the del ptr operation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/140936 Approved by: https://github.com/ezyang	2024-11-22 13:50:35 +00:00
zeshengzong	4009d15412	Optimize hook description of register_module_forward_hook (#140379 ) Fixes #74024 Optimize description as the issue suggested Pull Request resolved: https://github.com/pytorch/pytorch/pull/140379 Approved by: https://github.com/mikaylagawarecki	2024-11-22 13:40:45 +00:00
Yu, Guangye	1af69eee4a	Solid XPU UT test_memory_allocation (#141325 ) # Motivation Fix https://github.com/pytorch/pytorch/issues/141326 # Additional Context We use the previous value queried by these APIs as the reference value rather than 0. With this PR, we don't depend on the Python garbage collection mechanism anymore. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141325 Approved by: https://github.com/EikanWang	2024-11-22 13:14:49 +00:00
ankurneog	f497a0039c	API to retrieve default distributed backend from device (#140536 ) # Motivation The distributed APIs rely on backend names for creation of process group. To abstract out references of these names from PG creation, an API is added to get default distributed backend for device. The device code would need to register its device and backend via ```torch.distributed.Backend.register_backend``` or update the map ``` torch.distributed.Backend.default_device_backend_map["device"] = "distributed_backend" ``` prior to using the API. An example of use is added in the test file ( which can be used to check abstracted APIs) Pull Request resolved: https://github.com/pytorch/pytorch/pull/140536 Approved by: https://github.com/kwen2501	2024-11-22 11:01:53 +00:00
Edward Z. Yang	7d89a8d385	Add ExportedProgram type annotation (#141247 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/141247 Approved by: https://github.com/Skylion007	2024-11-22 10:40:42 +00:00
Shreyans Pathak	a6344c8bcd	Throw an error if args contain reserved python keywords (#135357 ) This PR adds a check for reserved python keywords in the `torchgen/gen.py/error_check_native_functions` function. Fixes #135127 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135357 Approved by: https://github.com/ezyang Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>	2024-11-22 07:44:50 +00:00
Cenyue Zhang	bd971cc395	safer check for isatty in fx/_utils.py (#140876 ) if no isatty method is defined, it's probably not a tty Pull Request resolved: https://github.com/pytorch/pytorch/pull/140876 Approved by: https://github.com/ezyang	2024-11-22 07:27:28 +00:00
cyy	1bdb92cbff	[2/N] Use thread-safe strerror (#141011 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/141011 Approved by: https://github.com/ezyang	2024-11-22 07:02:30 +00:00
Shivam Raikundalia	8b13ed594a	Add skip_first_wait to profiler.schedule (#141070 ) Summary: Addresses https://github.com/pytorch/pytorch/issues/91888 We use wait as the amount you wait in between cycles when profiling and skip_first to delay the start of said profiling. However, once skip_first steps are completed, we immediately go to the wait phase. This is not problematic if wait is smaller than skip_first because we can just lower the values of skip_first, but if it is larger then we end up starting the first profile much later than desired. For example imagine a skip first of 1 and a wait of 100 with repeat of 2. We do want to wait 100 steps in between cycle 1 and 2 but we may not want to start warmup of cycle 1 at step 101 (forced because wait occurs directly after first steps skipped). This diff addresses this by adding a flag to skip the first wait. Adds new flag but sets to false by default so that existing impl is not affected. Test Plan: Got reasonable traces with this schedule: schedule=torch.profiler.schedule( wait=10, warmup=3, active=1, repeat=1, skip_first=1, skip_first_wait=1 ) Differential Revision: D66198138 Pull Request resolved: https://github.com/pytorch/pytorch/pull/141070 Approved by: https://github.com/aaronenyeshi, https://github.com/briancoutinho	2024-11-22 06:40:58 +00:00
angelayi	a3e516d165	[aoti] Split custom ops tests (#140977 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/140977 Approved by: https://github.com/desertfire	2024-11-22 06:18:25 +00:00
Jason Ansel	3acc6eac49	[inductor] Add typing to ir.py 2 (#140915 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/140915 Approved by: https://github.com/aorenste	2024-11-22 04:56:54 +00:00
cyy	35ecca735e	[2/N] Replace at::detail::Array with std::array (#141205 ) Follows #122064 Pull Request resolved: https://github.com/pytorch/pytorch/pull/141205 Approved by: https://github.com/ezyang	2024-11-22 04:44:40 +00:00
Johnson Wong	3fcf66f61f	Add option to split Linear gates for Quantizable LSTM into separate ops (#140868 ) Summary: For LSTM, the input and hidden state are projected with Linear layers to construct the 4 gates. This is typically performed together as a single Linear (for each state) with output channel count `4 * hidden_dim` for efficiency. https://www.internalfb.com/code/fbsource/[ebef7c4238aa55948b2b444044f2c8ed2040de55]/fbcode/caffe2/torch/ao/nn/quantizable/modules/rnn.py?lines=52-58 The output is then ultimately split into 4: https://www.internalfb.com/code/fbsource/[ebef7c4238aa55948b2b444044f2c8ed2040de55]/fbcode/caffe2/torch/ao/nn/quantizable/modules/rnn.py?lines=83-87 For on-device latency (and possibly memory) considerations, we want to avoid constructing the intermediate `gates` tensor (which can be relatively large), by splitting `igates` and `hgates` first (as 4x `Linear(hidden_dim, hidden_dim)` each), applying add separately, then proceeding as usual. This functionality can be enabled by specifying `split_gates=True` (default False is original behavior) at any entry point (directly with `torch.ao.nn.quantizable.LSTM` or via `_get_lstm_with_individually_observed_parts`). Test Plan: piggy back on existing test to check for correct swap handling, numerics, and jit.script during prepare/convert ``` buck2 test 'fbcode//mode/opt' fbcode//caffe2/test/quantization:test_quantization -- --exact 'caffe2/test/quantization:test_quantization - test_custom_module_lstm (caffe2.test.quantization.core.test_quantized_op.TestQuantizedOps)' ``` https://www.internalfb.com/intern/testinfra/testrun/11540474102848372 This test is quite long running now (more than double original). Reviewed By: Ninja91 Differential Revision: D65283170 Pull Request resolved: https://github.com/pytorch/pytorch/pull/140868 Approved by: https://github.com/jerryzh168	2024-11-22 04:10:26 +00:00
Chirag Pandya	150ffb6e07	[flight recorder] Updated MatchState to have a member variable (#141297 ) Summary: Without this change calling `str(MatchState.SOMETHING)` will cause exception. Test Plan: Can we add unittest somewhere? Ensure `str(MatchState.FULLY_MATCHED)` and `str(MatchState.FULLY_MATCHED())` won't raise exception. Differential Revision: D66321609 Pull Request resolved: https://github.com/pytorch/pytorch/pull/141297 Approved by: https://github.com/fduwjj	2024-11-22 03:14:34 +00:00
Mikayla Gawarecki	3bec67b8e5	Fix tests in test/test_serialization that were failing if run individually (#141300 ) #140739 and #140740 made it such that `get_safe_globals` no longer return an empty List by default This caused some tests that check the content of `get_safe_globals` to fail, in particular when run individually (they didn't fail in test suite as other tests ran before them called `clear_safe_globals`) but will fail when tests are run individually [T208186010](https://www.internalfb.com/intern/tasks/?t=208186010) test_safe_globals_for_weights_only test_safe_globals_context_manager_weights_only This PR fixes that and also makes most tests calling `clear_safe_globals` use the `safe_globals` context manager rather than try: finally Pull Request resolved: https://github.com/pytorch/pytorch/pull/141300 Approved by: https://github.com/awgu	2024-11-22 02:40:37 +00:00
Wei Wang	dbe6fce185	[CUDA][Nightly Binary] Remove PTX from cuda 12.4 Nightly (#141142 ) Separate cuda 12.4 \| 12.6 logic Remove PTX from cuda 12.4 Remove deprecated cuda 11.[6/7] Discussed in https://github.com/pytorch/pytorch/issues/137374#issuecomment-2489200733 Pull Request resolved: https://github.com/pytorch/pytorch/pull/141142 Approved by: https://github.com/atalman	2024-11-22 02:34:59 +00:00
Isuru Fernando	c25b201583	Move Sympy printers to torch/utils/_sympy/printers.py (#140597 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/140597 Approved by: https://github.com/ezyang, https://github.com/anijain2305	2024-11-22 02:04:36 +00:00
atalman	c83b739f14	Migrate pull jobs cuda12.1->cuda12.4 (#141271 ) Cuda 12.1 nightly builds where deprecated. Hence no reason on keep testing cuda 12.1 in CI Pull Request resolved: https://github.com/pytorch/pytorch/pull/141271 Approved by: https://github.com/Skylion007, https://github.com/malfet, https://github.com/huydhn	2024-11-22 01:52:38 +00:00
Shangdi Yu	f28bac76f5	[AOTI Minifier] Save EP instead of graphs (#141159 ) Summary: `repro.py` can have nested graph modules, e.g. ``` class Repro(torch.nn.Module): def __init__(self) -> None: super().__init__() self.true_graph_0 = GraphModule() def forward(self): true_graph_0 = self.true_graph_0 return (true_graph_0,) ``` So dumping the string doesn’t always work. So, 1) we use exported program in repro.py instead 2) we still dump the graph module string, but only put it in comments We also added two flags to `minifier_launcher.py` - `minifier-export-mode`: whether strict or non-strict export is used in the minifier - `skip-export-error`: intermediate graphs that cannot be exported will be skipped. Test Plan: ``` buck2 run fbcode//caffe2/test/inductor:minifier_utils_cpu -- -r string python test/inductor/test_minifier.py ``` Differential Revision: D66175257 Pull Request resolved: https://github.com/pytorch/pytorch/pull/141159 Approved by: https://github.com/henrylhtsang	2024-11-22 01:51:10 +00:00
sanchitintel	ca9813ea14	Simplify & rectify dequantized B buffer loading for AMX GEMM micro-kernel for WoQ int8 case (#140258 ) As suggested by @leslie-fang-intel in `4c83e4e751 (diff-139642bd981df977f70f4c18c1c34bd1a85c1d6b9ffa06aaa98426ed83942a31R537)` - all elements of `B` tiles (not referring to AMX tiles, but the tiles at the granularity of the micro-kernel) have contiguous elements since `B` matrix is pre-packed, so dequantized buffer loading logic can be simplified. While the previous approach kept elements to be loaded into a B AMX tile contiguous, the new approach doesn't entail any performance penalty either because that data is already in L1D, so loading AMX tiles from non-contiguous dequantized B elements doesn't adversely affect performance. Also rectified the size of the dequantized B buffer. Fixes #140208. A subsequent PR will factor out caching of dequantized int8 weights into a separate codegen function Pull Request resolved: https://github.com/pytorch/pytorch/pull/140258 Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel	2024-11-22 01:34:06 +00:00
Colin L. Rice	f5d00f1456	pytorch/features: Make a feature logger and record triton bundling (#141056 ) This modifies metrics_context to allow us to store whether a feature was used or not. This also starts recording this for triton bundling. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141056 Approved by: https://github.com/masnesral	2024-11-22 01:31:08 +00:00
Shangdi Yu	0155a112fd	[export] avoid name collision when inlining node (#141169 ) Summary: When we have both `set_grad` and `autocast` HOP, name collision might happen when we try to inline a node. For exmaple, for a GraphModule like this: ``` GraphModule( (submod_0): GraphModule( (submod_1): GraphModule() ) (submod_1): GraphModule() (submod_2): GraphModule() ) ``` when we inline `submod_0`, we might accidentally overwrite `submod_1`. In this PR, we fix this by check if the graph module already has an attribute with the same name, if so, we use the next "submod_{i}", until no name collision. Partially fixes https://github.com/pytorch/pytorch/issues/140589. Test Plan: ``` buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test:test_export -- -r test_predispatch_autocast_and_set_grad ``` Differential Revision: D66200994 Pull Request resolved: https://github.com/pytorch/pytorch/pull/141169 Approved by: https://github.com/angelayi	2024-11-22 01:08:22 +00:00
Nikita Shulga	d8b4406e12	[MPS] Expand fused forloop to bfloat16 (#141104 ) For MacOS14+ Running following script (adapted from one mentioned in https://github.com/pytorch/pytorch/pull/127242 ) ```python import torch from torch.optim import adam, adamw import torch.utils.benchmark as benchmark import itertools def profile(fn, params, grads, exp_avgs, exp_avg_sqs, max_exp_avg_sqs, state_steps, amsgrad, fused): fn( params, grads, exp_avgs, exp_avg_sqs, max_exp_avg_sqs, state_steps, foreach=False, capturable=False, fused=fused, amsgrad=amsgrad, beta1=0.9, beta2=0.99, lr=1e-3, weight_decay=.0, eps=1e-5, maximize=False, grad_scale=None, found_inf=None, ) torch.mps.synchronize() device, dtype = "mps", torch.bfloat16 results = [] for num_tensors, numel, adamWflag, amsgrad in itertools.product([10, 50, 100], [1024, 65536, 1048576], [True, False], [True, False]): print(f"amsgrad: {amsgrad}, adamWflag: {adamWflag}, numel: {numel}, num_tensors: {num_tensors}") params, grads, exp_avgs, exp_avg_sqs = [[torch.arange(numel, dtype=dtype, device=device) + (numel * i) for i in range(num_tensors)] for _ in range(4)] max_exp_avg_sqs = [torch.arange(numel, dtype=dtype, device=device) for _ in range(num_tensors)] if amsgrad else [] state_steps = [torch.tensor([5], dtype=dtype, device=device) for _ in range(num_tensors)] fn = adamw.adamw if adamWflag else adam.adam for fused in [True, False]: t = benchmark.Timer( stmt='profile(fn, params, grads, exp_avgs, exp_avg_sqs, max_exp_avg_sqs, state_steps, amsgrad, fused)', label=f'Fused Adam on {device} using {dtype}', sub_label=f"amsgrad: {amsgrad}, adamWflag: {adamWflag}, numel: {numel}, num_tensors: {num_tensors}", globals=locals(), description= f"Fused: {fused}", ).blocked_autorange(min_run_time=5) results.append(t) compare = benchmark.Compare(results) compare.trim_significant_figures() compare.colorize(rowwise=True) compare.print() ``` Produces following results on M4Pro running MacOS 15 ``` [-------------------------------- Fused Adam on mps using torch.bfloat16 -------------------------------] \| Fused: True \| Fused: False 1 threads: ---------------------------------------------------------------------------------------------- amsgrad: True, adamWflag: True, numel: 1024, num_tensors: 10 \| 283 \| 2810 amsgrad: False, adamWflag: True, numel: 1024, num_tensors: 10 \| 277 \| 2430 amsgrad: True, adamWflag: False, numel: 1024, num_tensors: 10 \| 285 \| 2400 amsgrad: False, adamWflag: False, numel: 1024, num_tensors: 10 \| 278 \| 2250 amsgrad: True, adamWflag: True, numel: 65536, num_tensors: 10 \| 504 \| 2700 amsgrad: False, adamWflag: True, numel: 65536, num_tensors: 10 \| 478 \| 2600 amsgrad: True, adamWflag: False, numel: 65536, num_tensors: 10 \| 506 \| 2500 amsgrad: False, adamWflag: False, numel: 65536, num_tensors: 10 \| 482 \| 2300 amsgrad: True, adamWflag: True, numel: 1048576, num_tensors: 10 \| 2089 \| 4190 amsgrad: False, adamWflag: True, numel: 1048576, num_tensors: 10 \| 1940 \| 3800 amsgrad: True, adamWflag: False, numel: 1048576, num_tensors: 10 \| 2100 \| 3770 amsgrad: False, adamWflag: False, numel: 1048576, num_tensors: 10 \| 1950 \| 3600 amsgrad: True, adamWflag: True, numel: 1024, num_tensors: 50 \| 842 \| 14000 amsgrad: False, adamWflag: True, numel: 1024, num_tensors: 50 \| 835 \| 11800 amsgrad: True, adamWflag: False, numel: 1024, num_tensors: 50 \| 845 \| 11700 amsgrad: False, adamWflag: False, numel: 1024, num_tensors: 50 \| 855 \| 11000 amsgrad: True, adamWflag: True, numel: 65536, num_tensors: 50 \| 1410 \| 14000 amsgrad: False, adamWflag: True, numel: 65536, num_tensors: 50 \| 1350 \| 12000 amsgrad: True, adamWflag: False, numel: 65536, num_tensors: 50 \| 1400 \| 12000 amsgrad: False, adamWflag: False, numel: 65536, num_tensors: 50 \| 1340 \| 11000 amsgrad: True, adamWflag: True, numel: 1048576, num_tensors: 50 \| 9767 \| 20400 amsgrad: False, adamWflag: True, numel: 1048576, num_tensors: 50 \| 8991 \| 18600 amsgrad: True, adamWflag: False, numel: 1048576, num_tensors: 50 \| 9803 \| 18300 amsgrad: False, adamWflag: False, numel: 1048576, num_tensors: 50 \| 9070 \| 17600 amsgrad: True, adamWflag: True, numel: 1024, num_tensors: 100 \| 1600 \| 27000 amsgrad: False, adamWflag: True, numel: 1024, num_tensors: 100 \| 1600 \| 24100 amsgrad: True, adamWflag: False, numel: 1024, num_tensors: 100 \| 1600 \| 23500 amsgrad: False, adamWflag: False, numel: 1024, num_tensors: 100 \| 1600 \| 21800 amsgrad: True, adamWflag: True, numel: 65536, num_tensors: 100 \| 2740 \| 26000 amsgrad: False, adamWflag: True, numel: 65536, num_tensors: 100 \| 2580 \| 24000 amsgrad: True, adamWflag: False, numel: 65536, num_tensors: 100 \| 2730 \| 25000 amsgrad: False, adamWflag: False, numel: 65536, num_tensors: 100 \| 2600 \| 23000 amsgrad: True, adamWflag: True, numel: 1048576, num_tensors: 100 \| 19350 \| 39000 amsgrad: False, adamWflag: True, numel: 1048576, num_tensors: 100 \| 17780 \| 37300 amsgrad: True, adamWflag: False, numel: 1048576, num_tensors: 100 \| 19400 \| 37000 amsgrad: False, adamWflag: False, numel: 1048576, num_tensors: 100 \| 17900 \| 35500 Times are in microseconds (us). ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/141104 Approved by: https://github.com/qqaatw, https://github.com/kulinseth, https://github.com/Skylion007 ghstack dependencies: #141089, #141090, #141092, #141103	2024-11-22 01:07:15 +00:00
PyTorch MergeBot	989888236e	Revert "[MPS] Expand fused forloop to bfloat16 (#141104 )" This reverts commit 9a729390420570cd2528ce2e9947e3eab209660b. Reverted https://github.com/pytorch/pytorch/pull/141104 on behalf of https://github.com/malfet due to Want to add test script to the commit message ([comment](https://github.com/pytorch/pytorch/pull/141104#issuecomment-2492659931))	2024-11-22 01:03:43 +00:00
Ke Wen	e8de8f3969	[CI] Reduce distributed test timeout to 60s (#141168 ) Pulling a PR to test viability. Today's timeout is 300s, which could waste quite some machine time if a hang happens in CI. Differential Revision: [D66275756](https://our.internmc.facebook.com/intern/diff/D66275756) Pull Request resolved: https://github.com/pytorch/pytorch/pull/141168 Approved by: https://github.com/clee2000	2024-11-22 00:59:55 +00:00
Nikita Shulga	65166d86a3	[MPS] Add regression test for sync deadlock (#141296 ) See https://github.com/pytorch/pytorch/pull/140725#issuecomment-2492434870 Running `torch.mps.synchronize()` after metal kernel resulted in infinite wait inside `[_MTLCommandBuffer waitUntilCompleted]` ``` (lldb) bt * thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP * frame #0: 0x00000001aa919084 Metal`pthread_cond_wait + 12 frame #1: 0x00000001aa78b1b4 Metal`-[_MTLCommandBuffer waitUntilCompleted] + 84 frame #2: 0x00000001032bf358 libtorch_python.dylib`torch::mps::MPSModule_deviceSynchronize(_object, _object) + 40 frame #3: 0x0000000100e94c20 Python`cfunction_vectorcall_NOARGS + 100 frame #4: 0x0000000100e389b8 Python`PyObject_Vectorcall + 92 frame #5: 0x0000000100f61e38 Python`_PyEval_EvalFrameDefault + 19040 frame #6: 0x0000000100f5d180 Python`PyEval_EvalCode + 200 frame #7: 0x0000000100fcd1a4 Python`run_eval_code_obj + 104 frame #8: 0x0000000100fccbe4 Python`run_mod + 168 frame #9: 0x0000000100fcb518 Python`pyrun_file + 164 frame #10: 0x0000000100fca854 Python`_PyRun_SimpleFileObject + 256 frame #11: 0x0000000100fca4e8 Python`_PyRun_AnyFileObject + 80 frame #12: 0x0000000100ff2028 Python`pymain_run_file_obj + 164 frame #13: 0x0000000100ff1ce4 Python`pymain_run_file + 72 frame #14: 0x0000000100ff0f74 Python`Py_RunMain + 988 frame #15: 0x0000000100ff1564 Python`pymain_main + 304 frame #16: 0x0000000100ff1604 Python`Py_BytesMain + 40 frame #17: 0x000000019f630274 dyld`start + 2840 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/141296 Approved by: https://github.com/huydhn	2024-11-22 00:56:33 +00:00
Svetlana Karslioglu	25c0b91dbb	[Docs] Make links to source link to source (#141186 ) Rewrite [SOURCE] links in the API docs to point to the source file in github repo. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141186 Approved by: https://github.com/malfet, https://github.com/msaroufim Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2024-11-22 00:50:19 +00:00
Sun, Jiayi	f708e92ba1	[Inductor] support Conv/Linear + broadcast add fusion (#138201 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138201 Approved by: https://github.com/jgong5, https://github.com/jansel	2024-11-22 00:47:25 +00:00
PyTorch MergeBot	5ab5a61671	Revert "[ROCm][CI] upgrade CI to ROCm 6.2.4 (#140851 )" This reverts commit 6c9bfd52b6a76ddff053bcff4d23ea7f4c280e9a. Reverted https://github.com/pytorch/pytorch/pull/140851 on behalf of https://github.com/jithunnair-amd due to Need to upgrade libtorch images to ROCm 6.2.4 as well ([comment](https://github.com/pytorch/pytorch/pull/140851#issuecomment-2492641342))	2024-11-22 00:44:34 +00:00
Edward Z. Yang	612122af8f	Fix type-safety of torch.nn.Module instances (#141240 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/141240 Approved by: https://github.com/Skylion007, https://github.com/malfet	2024-11-22 00:05:05 +00:00
Alan Du	f869a0ffe1	Fix the use of fsspec transactions (#135541 ) fsspec transactions do not support concurrency and assumes that there is at most 1 running transaction per filesystem. This is not true in our usage, where because of multi-threading we usually have multiple concurrent transactions running at once. Previously, this would just (unsafely) pass but lead to hard-to-debug race conditions (since the commit of one transaction will blow away the state of the other transaction). In fsspec 2024.3.0, trying to commit concurrent transactions will actually crash (see the code at `76ca4a6888/fsspec/transaction.py (L39)` -- because each filesystem can have a single transaction, this tear-down logic will error). Instead, let's manually handle committing / discarding changes to the file. This does this "the old-fashioned way" instead of using `fsspec`'s commit/rollback behavior because the internal PathManagerFileSystem used for `iopath` does not properly support that behavior. I don't have a minimal test-case, but in Meta this solves a broken test on `fsspec >= 2024.3.0`: Before: https://www.internalfb.com/intern/testinfra/testrun/7318349626774607 After: https://www.internalfb.com/intern/testinfra/testrun/2251800062722633 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135541 Approved by: https://github.com/Skylion007	2024-11-22 00:03:19 +00:00
Pian Pawakapan	e894219504	[export] fix loss_output in joint graph signature (#140974 ) Summary: joint-graph export is marking all outputs as LOSS_OUTPUT, fix so it marks only the correct one Test Plan: test_experimental Differential Revision: D66117412 Pull Request resolved: https://github.com/pytorch/pytorch/pull/140974 Approved by: https://github.com/JacobSzwejbka	2024-11-21 23:57:07 +00:00
Fabian Marin	f044c1a7c8	Fixes #140986 , improves wording and grammar of nn/module.py (#140987 ) Fixes #140986 This includes several improvements on the grammar and wording of nn/module.py, mostly simple one word fixes, but also other slightly more elaborate ones. It addresses about half of the docs for module.py but I would be glad to cover the rest of it if required to do so. Pull Request resolved: https://github.com/pytorch/pytorch/pull/140987 Approved by: https://github.com/mikaylagawarecki	2024-11-21 23:40:43 +00:00
Jesse Cai	45b30a5aec	[sparse] add extra options to _cslt_spare_mm (#137427 ) Summary: Splitting this PR into two, one for the cuSPARSELt improvements, and one for the inductor lowering. This PR adds in the additional cuSPARSELt bindings into pytorch. * `torch._cslt_sparse_mm_search` will be deprecated in a future PR, so a warning has been added * Added a header file for cuSPARSELtOps.cpp * max_id is now available in `torch.backends.cusparselt` via `torch.backends.cusparselt.get_max_alg_id()` * fixed meta registrations for float8 Test Plan: python test/test_sparse_semi_structured.py Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/137427 Approved by: https://github.com/cpuhrsch, https://github.com/eqy	2024-11-21 23:37:36 +00:00
Nikita Shulga	9a72939042	[MPS] Expand fused forloop to bfloat16 (#141104 ) For MacOS14+ Running following script ```python ``` Produces following results on M4Pro running MacOS 15 ``` [-------------------------------- Fused Adam on mps using torch.bfloat16 -------------------------------] \| Fused: True \| Fused: False 1 threads: ---------------------------------------------------------------------------------------------- amsgrad: True, adamWflag: True, numel: 1024, num_tensors: 10 \| 283 \| 2810 amsgrad: False, adamWflag: True, numel: 1024, num_tensors: 10 \| 277 \| 2430 amsgrad: True, adamWflag: False, numel: 1024, num_tensors: 10 \| 285 \| 2400 amsgrad: False, adamWflag: False, numel: 1024, num_tensors: 10 \| 278 \| 2250 amsgrad: True, adamWflag: True, numel: 65536, num_tensors: 10 \| 504 \| 2700 amsgrad: False, adamWflag: True, numel: 65536, num_tensors: 10 \| 478 \| 2600 amsgrad: True, adamWflag: False, numel: 65536, num_tensors: 10 \| 506 \| 2500 amsgrad: False, adamWflag: False, numel: 65536, num_tensors: 10 \| 482 \| 2300 amsgrad: True, adamWflag: True, numel: 1048576, num_tensors: 10 \| 2089 \| 4190 amsgrad: False, adamWflag: True, numel: 1048576, num_tensors: 10 \| 1940 \| 3800 amsgrad: True, adamWflag: False, numel: 1048576, num_tensors: 10 \| 2100 \| 3770 amsgrad: False, adamWflag: False, numel: 1048576, num_tensors: 10 \| 1950 \| 3600 amsgrad: True, adamWflag: True, numel: 1024, num_tensors: 50 \| 842 \| 14000 amsgrad: False, adamWflag: True, numel: 1024, num_tensors: 50 \| 835 \| 11800 amsgrad: True, adamWflag: False, numel: 1024, num_tensors: 50 \| 845 \| 11700 amsgrad: False, adamWflag: False, numel: 1024, num_tensors: 50 \| 855 \| 11000 amsgrad: True, adamWflag: True, numel: 65536, num_tensors: 50 \| 1410 \| 14000 amsgrad: False, adamWflag: True, numel: 65536, num_tensors: 50 \| 1350 \| 12000 amsgrad: True, adamWflag: False, numel: 65536, num_tensors: 50 \| 1400 \| 12000 amsgrad: False, adamWflag: False, numel: 65536, num_tensors: 50 \| 1340 \| 11000 amsgrad: True, adamWflag: True, numel: 1048576, num_tensors: 50 \| 9767 \| 20400 amsgrad: False, adamWflag: True, numel: 1048576, num_tensors: 50 \| 8991 \| 18600 amsgrad: True, adamWflag: False, numel: 1048576, num_tensors: 50 \| 9803 \| 18300 amsgrad: False, adamWflag: False, numel: 1048576, num_tensors: 50 \| 9070 \| 17600 amsgrad: True, adamWflag: True, numel: 1024, num_tensors: 100 \| 1600 \| 27000 amsgrad: False, adamWflag: True, numel: 1024, num_tensors: 100 \| 1600 \| 24100 amsgrad: True, adamWflag: False, numel: 1024, num_tensors: 100 \| 1600 \| 23500 amsgrad: False, adamWflag: False, numel: 1024, num_tensors: 100 \| 1600 \| 21800 amsgrad: True, adamWflag: True, numel: 65536, num_tensors: 100 \| 2740 \| 26000 amsgrad: False, adamWflag: True, numel: 65536, num_tensors: 100 \| 2580 \| 24000 amsgrad: True, adamWflag: False, numel: 65536, num_tensors: 100 \| 2730 \| 25000 amsgrad: False, adamWflag: False, numel: 65536, num_tensors: 100 \| 2600 \| 23000 amsgrad: True, adamWflag: True, numel: 1048576, num_tensors: 100 \| 19350 \| 39000 amsgrad: False, adamWflag: True, numel: 1048576, num_tensors: 100 \| 17780 \| 37300 amsgrad: True, adamWflag: False, numel: 1048576, num_tensors: 100 \| 19400 \| 37000 amsgrad: False, adamWflag: False, numel: 1048576, num_tensors: 100 \| 17900 \| 35500 Times are in microseconds (us). ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/141104 Approved by: https://github.com/qqaatw, https://github.com/kulinseth, https://github.com/Skylion007 ghstack dependencies: #141089, #141090, #141092, #141103	2024-11-21 23:30:37 +00:00
Jake Harmon	740d1eb030	Fix test_out when run on CPU with CUDA available (#137140 ) Ever since #135140, this test will fail if run with CPU parameterization (e.g. test_out__refs_logical_or_cpu_float32) and CUDA available - as far as I can tell, the PyTorch CI isn't currently checking for this. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137140 Approved by: https://github.com/ezyang	2024-11-21 23:10:07 +00:00
isalia20	37fe8015ac	softshrink nan fixes (#138421 ) Fixes #138385 . Currently contains fixes for cpu and cuda. Will add fixes to mps as well soon if my mac can build it from source.(Had some issues with building it on my linux pc due to limited memory) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138421 Approved by: https://github.com/mikaylagawarecki	2024-11-21 23:06:08 +00:00
Jithun Nair	3b84fb26d0	Enable inductor-rocm workflow for all trunk commits AND inductor-related PRs (#138623 ) It should help with triaging ROCm-inductor-related breakages and surfacing them in the PRs itself. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138623 Approved by: https://github.com/huydhn	2024-11-21 22:51:49 +00:00
Catherine Lee	ba5c4a727f	Upload sccache stats into benchmark database with build step time (#140839 ) Guinea pig benchmark database Pull Request resolved: https://github.com/pytorch/pytorch/pull/140839 Approved by: https://github.com/huydhn Co-authored-by: Huy Do <huydhn@gmail.com>	2024-11-21 22:38:45 +00:00
Joseph Kleinhenz	7b2138b864	[inductor] fix uncaught exception when checking for openmp on macos (#141208 ) Based on #133776 Pull Request resolved: https://github.com/pytorch/pytorch/pull/141208 Approved by: https://github.com/Skylion007	2024-11-21 22:17:52 +00:00
Justin Chu	e908f9278f	[ONNX] Remove test_save_with_without_initializer test (#141263 ) The test is flaky and obsolete. So remove. Fixes https://github.com/pytorch/pytorch/issues/125020 Pull Request resolved: https://github.com/pytorch/pytorch/pull/141263 Approved by: https://github.com/titaiwangms	2024-11-21 22:06:15 +00:00
Mengwei Liu	e28b09517f	[miniz] Make sure miniz extra_size_remaining doesn't go off bound (#141266 ) #140041 added some logic to fix a zip64 header error. This PR makes sure `extra_size_remaining` doesn't overflow. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141266 Approved by: https://github.com/angelayi	2024-11-21 22:02:28 +00:00
PyTorch MergeBot	5e54cf3687	Revert "Fix MPS synchronize by waiting for root buffer to complete (#140725 )" This reverts commit 9bc9d4cdb4355a385a7d7959f07d04d1648d6904. Reverted https://github.com/pytorch/pytorch/pull/140725 on behalf of https://github.com/malfet due to It causes deadlocks when I try to run something benchmark from https://github.com/pytorch/pytorch/pull/127242 ([comment](https://github.com/pytorch/pytorch/pull/140725#issuecomment-2492416501))	2024-11-21 21:56:22 +00:00
drisspg	cc36d039d4	[FlexAttention] Rename zeros_and_scatter library (#141185 ) # Summary Previous custom op library name was a little verbose and didn't really align with how we typically name our libraries. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141185 Approved by: https://github.com/Chillee ghstack dependencies: #141164	2024-11-21 21:35:48 +00:00
drisspg	073cbf2c9d	[FlexAttention] Fix another IMA with captured buffers (#141164 ) # Summary We have another IMA for captured buffers when we are the sequences are not divisible. Running test before this commit: ```Shell ========= Error: process didn't terminate successfully ========= Target application returned an error ========= ERROR SUMMARY: 447 errors ========= ERROR SUMMARY: 347 errors were not printed. Use --print-limit option to adjust the number of printed errors ``` And After ```Shell ❯ CUDA_LAUNCH_BLOCKING=1 PYTORCH_NO_CUDA_MEMORY_CACHING=1 compute-sanitizer --tool memcheck pytest test/inductor/test_flex_attention.py -k "test_non_divisible_with_captured_buffer" ========= COMPUTE-SANITIZER ====================================================== test session starts ======================================================= platform linux -- Python 3.12.7, pytest-7.4.0, pluggy-1.5.0 rootdir: /home/drisspg/meta/pytorch configfile: pytest.ini plugins: hypothesis-6.115.5, typeguard-4.3.0 collected 518 items / 517 deselected / 1 selected Running 1 items in this shard test/inductor/test_flex_attention.py . [100%] =============================================== 1 passed, 517 deselected in 13.31s =============================================== ========= ERROR SUMMARY: 0 errors ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/141164 Approved by: https://github.com/Chillee	2024-11-21 21:35:48 +00:00
Sam Ginzburg	a0e84ff5c6	[inductor] Check Triton Autotuner.__init__ for pre_hook/post_hook (#141040 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/141040 Approved by: https://github.com/aakhundov ghstack dependencies: #140982	2024-11-21 21:30:01 +00:00
Animesh Jain	fa63276691	[user empathy day][dynamo] Support get on subclassed dicts (#141214 ) Fixes https://github.com/pytorch/pytorch/issues/141138 but we need to do a more exhaustive job of going through dict methods and check each one of them. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141214 Approved by: https://github.com/Skylion007, https://github.com/jansel ghstack dependencies: #141209	2024-11-21 21:18:42 +00:00
Animesh Jain	d7402cd196	[user-empathy-day][dynamo] Remove speical casing for torch.nn.Parameter tracing (#141209 ) This was done to reduce compile time ealier, but I have seen two cases in past month where this code falters, one from the user empathy day - https://docs.google.com/document/d/1nEX1GtKhNzid6NvNg5CaVamO6JrJoKPuJ2iueWUYFWc/edit?tab=t.0 So removing this code. It can affect compile time for a few models by a few seconds, but its way less code to maintain. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141209 Approved by: https://github.com/jansel	2024-11-21 21:18:42 +00:00
Jithun Nair	6c9bfd52b6	[ROCm][CI] upgrade CI to ROCm 6.2.4 (#140851 ) Fixes issue of long docker build times in PRs which trigger the docker build in regular PyTorch build jobs eg. https://github.com/pytorch/pytorch/actions/runs/11751388838/job/32828886198. These docker builds take a long time for ROCm6.2 because: 1. They are run on less capable machines (`c5.2xlarge`) instead of the beefier ones on which [docker-build workflows](`924c1fe3f3/.github/workflows/docker-builds.yml (L50)`) run (`c5.12xlarge`) 2. ROCm6.2 docker builds enabled building of MIOpen from source, which runs into [timeout of 90mins](`9abd4d95bb/.github/actions/calculate-docker-image/action.yml (L171)`): https://github.com/pytorch/pytorch/actions/runs/11751388838/job/32828886198#step:7:160 Pull Request resolved: https://github.com/pytorch/pytorch/pull/140851 Approved by: https://github.com/jeffdaily	2024-11-21 21:12:48 +00:00
Jack Taylor	04f569a524	[ROCm] AMDSMI memory usage unification (#139900 ) Fixes https://github.com/pytorch/pytorch/issues/140638 Old implementation used vram_used, which is not the correct equivalent API for pynvml memory utilization. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139900 Approved by: https://github.com/jeffdaily, https://github.com/eqy	2024-11-21 21:11:39 +00:00
PyTorch MergeBot	614e727191	Revert "[Environment Variable][7/N] Use thread-safe getenv functions (#140211 )" This reverts commit cd942d00dde73dbf9d7c5f89fdd7152f3440c4ca. Reverted https://github.com/pytorch/pytorch/pull/140211 on behalf of https://github.com/izaitsevfb due to causes crash internally during test listing ([comment](https://github.com/pytorch/pytorch/pull/140211#issuecomment-2492328790))	2024-11-21 21:05:22 +00:00
Lakshya A Agrawal	6ba5fa47ea	Add reference to pad_packed_sequence in pack_padded_sequence doc (#137294 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137294 Approved by: https://github.com/mikaylagawarecki	2024-11-21 21:01:17 +00:00
Prajesh Praveen Anchalia	4e34fbdcbc	Add inductor_fx_graph_cache stats to dynamo_utils (#141190 ) Summary: Add the following inductor fx graph cache stats to dynamo compile - inductor_fx_cache_hit_count - inductor_fx_cache_miss_count - inductor_fx_cache_backend_type - inductor_fx_cache_hit_keys - inductor_fx_cache_miss_keys - remote_cache_version Test Plan: Run local tests and staging logger: P1683061460 Differential Revision: D66232206 Pull Request resolved: https://github.com/pytorch/pytorch/pull/141190 Approved by: https://github.com/masnesral	2024-11-21 20:59:10 +00:00
Ivan Zaitsev	149677e30c	Revert "[dynamo] Added cuda and triton versions to dynamo_compile" (#141280 ) Reverts pytorch/pytorch#141140 reason: conflicts with https://github.com/pytorch/pytorch/pull/141190 and wasn't merged using mergebot Pull Request resolved: https://github.com/pytorch/pytorch/pull/141280 Approved by: https://github.com/clee2000, https://github.com/kit1980	2024-11-21 20:50:06 +00:00
Jovian Anthony Jaison	11d0ba068f	[dynamo] Added cuda and triton versions to dynamo_compile (#141140 ) [dynamo] Added cuda and triton versions to dynamo_compile (#141140) Summary: Add cuda and triton versions to dynamo_compile logging site. Test Plan: $ buck2 run mode/opt //scripts/oulgen:runner File changed: fbcode//caffe2/torch/_dynamo/convert_frame.py Buck UI: https://www.internalfb.com/buck2/1a8ada1f-d54e-44b2-a368-b2ff2030e113 Network: Up: 65KiB Down: 0B (reSessionID-8f4d1d6d-a680-4ecc-8e73-c29c932d824b) Jobs completed: 2166. Time elapsed: 7.0s. Cache hits: 0%. Commands: 3 (cached: 0, remote: 0, local: 3) BUILD SUCCEEDED ... Cuda: 12.4.0 Triton: 3.0.0 Reviewed By: masnesral Differential Revision: D66181508	2024-11-21 12:20:02 -08:00
Diego Sandoval	4fb4aa3e70	Updated docstrings referring to `torch.expand` to point to `torch.Tensor.expand` (#140045 ) `torch.expand` was moved to `torch.Tensor.expand` but some docstrings still refer to `torch.expand` Pull Request resolved: https://github.com/pytorch/pytorch/pull/140045 Approved by: https://github.com/mikaylagawarecki	2024-11-21 20:13:41 +00:00
PyTorch MergeBot	d3c8f1af8d	Revert "[export] serialize sympy.Exprs as ASTs instead of strings (#140084 )" This reverts commit d869344bc00bf7de815a2b69fb0909e7229bc5bf. Reverted https://github.com/pytorch/pytorch/pull/140084 on behalf of https://github.com/izaitsevfb due to reverted internally in D66253238 ([comment](https://github.com/pytorch/pytorch/pull/140084#issuecomment-2492165667))	2024-11-21 20:09:54 +00:00
Jason Ansel	da94ab0b66	[inductor] Add typing to ir.py 1 (#140912 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/140912 Approved by: https://github.com/aorenste ghstack dependencies: #140895, #140910	2024-11-21 20:01:57 +00:00
Jason Ansel	6eca0aee76	[inductor] Refactor ir.Layout into ir.OutputSpec (#140910 ) This separate the concepts of a Layout (size/stride/etc) and an OutputSpec (which includes multiple outputs). Which should make typing easier. Pull Request resolved: https://github.com/pytorch/pytorch/pull/140910 Approved by: https://github.com/ezyang ghstack dependencies: #140895	2024-11-21 20:01:57 +00:00
Colin Peppler	827f2f749e	[CUTLASS] Raise NotImplementedError if X & W aren't FixedLayout (#140985 ) Summary: title Differential Revision: D66131402 Pull Request resolved: https://github.com/pytorch/pytorch/pull/140985 Approved by: https://github.com/Skylion007	2024-11-21 19:59:19 +00:00
Sam Ginzburg	a847790400	[inductor] reset to zero support for user defined Triton kernels (#140982 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/140982 Approved by: https://github.com/aakhundov	2024-11-21 18:53:23 +00:00
Michael Diggin	723498aab8	Gaussian nll loss scalar variance support (#138931 ) Fixes #138747 Adds support for `variance` being a Tensor or a float in `gaussian_nll_loss` to avoid a cpu-gpu sync point in the loss function, when the variance is a static tensor like `<scalar>*torch.ones_like(input)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/138931 Approved by: https://github.com/mikaylagawarecki	2024-11-21 18:20:09 +00:00
Laith Sakka	e39955e82f	Avoid some max constructor optimizations when known not needed. (#139741 ) Summary: around 10% with 1K nodes more than that with 2K features. 414.5735 -> 333 (20%) This target optimizing patterns like this ``` sym_max: "Sym(Max(u31 + u32, u33 + u34))" = torch.sym_max(sym_sum_6, sym_sum_7); sym_sum_6 = sym_sum_7 = None sym_max_1: "Sym(Max(u31 + u32, u33 + u34, u35 + u36))" = torch.sym_max(sym_max, sym_sum_8); sym_max = sym_sum_8 = None sym_max_2: "Sym(Max(u31 + u32, u33 + u34, u35 + u36, u37 + u38))" = torch.sym_max(sym_max_1, sym_sum_9); sym_max_1 = sym_sum_9 = None sym_max_3: "Sym(Max(u31 + u32, u33 + u34, u35 + u36, u37 + u38, u39 + u40))" = torch.sym_max(sym_max_2, sym_sum_10); sym_max_2 = sym_sum_10 = None sym_max_4: "Sym(Max(u31 + u32, u33 + u34, u35 + u36, u37 + u38, u39 + u40, u41 + u42))" = torch.sym_max(sym_max_3, sym_sum_11); sym_max_3 = sym_sum_11 = None sym_max_5: "Sym(Max(u31 + u32, u33 + u34, u35 + u36, u37 + u38, u39 + u40, u41 + u42, u43 + u44))" = torch.sym_max(sym_max_4, sym_sum_12); sym_max_4 = sym_sum_12 = None sym_max_6: "Sym(Max(u31 + u32, u33 + u34, u35 + u36, u37 + u38, u39 + u40, u41 + u42, u43 + u44, u45 + u46))" = torch.sym_max(sym_max_5, sym_sum_13); sym_max_5 = sym_sum_13 = None sym_max_7: "Sym(Max(u31 + u32, u33 + u34, u35 + u36, u37 + u38, u39 + u40, u41 + u42, u43 + u44, u45 + u46, u47 + u48))" = torch.sym_max(sym_max_6, sym_sum_14); sym_max_6 = sym_sum_14 = None sym_max_8: "Sym(Max(u31 + u32, u33 + u34, u35 + u36, u37 + u38, u39 + u40, u41 + u42, u43 + u44, u45 + u46, u47 + u48, u49 + u50))" = torch.sym_max(sym_max_7, sym_sum_15); sym_max_7 = sym_sum_15 = sym_max_8 = None ``` <img width="496" alt="Screenshot 2024-11-05 at 11 00 35 AM" src="https://github.com/user-attachments/assets/455c06a3-e1bf-43cb-b880-9470ae6fb07f"> <img width="511" alt="Screenshot 2024-11-05 at 11 00 57 AM" src="https://github.com/user-attachments/assets/ff0d4236-9b5c-4a9a-8520-47b005bb3cb0"> Differential Revision: D65354971 Pull Request resolved: https://github.com/pytorch/pytorch/pull/139741 Approved by: https://github.com/ezyang	2024-11-21 16:50:52 +00:00
Scott Wolchok	75bbad4768	Unbreak CUDA 11.4 build of Half.h (#141173 ) `__CUDACC__` is needed to detect CUDA builds on that platform. Differential Revision: [D66262774](https://our.internmc.facebook.com/intern/diff/D66262774/) NOTE FOR REVIEWERS: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D66262774/)! Pull Request resolved: https://github.com/pytorch/pytorch/pull/141173 Approved by: https://github.com/Skylion007, https://github.com/malfet	2024-11-21 16:36:38 +00:00
Justin Chu	8e359a65f3	[ONNX] Use IRv10 (#141207 ) Update to use IRv10 to support INT4 types and ValueInfo in functions. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141207 Approved by: https://github.com/titaiwangms	2024-11-21 16:34:35 +00:00
Joel Schlosser	41f315417c	Fix NJT linear_backward() memory usage (#141163 ) Fixes #141112 The formula we're using for `linear_backward()` is inefficient for higher dim input sizes, even if the input is trivially higher dim (e.g. via use of `unsqueeze()`). This PR updates the formula to match the more efficient version employed by NST. Specifically, note the leading dim collapse for `grad_output`'s values before we compute the various matmuls. `d5ee1d1b58/aten/src/ATen/native/nested/NestedTensorBackward.cpp (L37-L70)` Testing for correctness is done via existing gradcheck tests (e.g. `test_backward_nn_functional_linear`). I added a memory usage test but I think it's likely there's a better way to do this. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141163 Approved by: https://github.com/Skylion007, https://github.com/cpuhrsch, https://github.com/soulitzer	2024-11-21 15:22:45 +00:00
Marvin Kim	f2f7ef9d59	Fix `stride` in TensorMetadata to always be a `Tuple[int, ...]` (#141106 ) Test Plan: CI Differential Revision: D66204410 Pull Request resolved: https://github.com/pytorch/pytorch/pull/141106 Approved by: https://github.com/Skylion007, https://github.com/evanleed	2024-11-21 14:52:36 +00:00
Will Constable	b25c291563	[C10D] Support group ranks in P2POp and batch_isend_irecv (#141054 ) Changes semantic of __repr__ of P2POp: s, d are now group ranks instead of global ranks. I think this is OK since I also updated the field names to make this obvious. Also add mypy annotations Partially addresses RFC 0042 (pytorch/rfcs#71) See more details/motivation in #140460 Pull Request resolved: https://github.com/pytorch/pytorch/pull/141054 Approved by: https://github.com/kwen2501	2024-11-21 14:51:56 +00:00
Yukio Siraichi	3b67d4d687	[inductor] Don't clamp on `split` operation. (#141078 ) This PR turns clamping off for the `split` operation. By doing so, we generate less bound guards and reduce the number of recompilation when varying the input size. ```python @torch.compile(dynamic=True) def f(x): return x.chunk(4) >>> f(torch.arange(12)) (tensor([0, 1, 2]), tensor([3, 4, 5]), tensor([6, 7, 8]), tensor([ 9, 10, 11])) >>> f(torch.arange(11)) (tensor([0, 1, 2]), tensor([3, 4, 5]), tensor([6, 7, 8]), tensor([ 9, 10])) >>> f(torch.arange(10)) (tensor([0, 1, 2]), tensor([3, 4, 5]), tensor([6, 7, 8]), tensor([9])) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/141078 Approved by: https://github.com/ezyang ghstack dependencies: #141077	2024-11-21 13:53:38 +00:00
Yukio Siraichi	154f90f026	[inductor] Don't specialize `split` on `sizes` parameter. (#141077 ) Fix: #139936 This PR modifies the lowering of `split` operation, so that it won't generate guards, specializing on the sizes parameter. Instead, it specializes on the number of output tensors being generated (i.e. function of the size of the base tensor, and the sizes parameter). As a result, operations such as `chunk` (whose number of output tensors usually is constant given a static chunk number) won't trigger recompiles when varying the size of the base tensor. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141077 Approved by: https://github.com/ezyang	2024-11-21 13:53:38 +00:00
Sun, Jiayi	dcf7728fd6	Update submodule ideep for ideep conv changes (#141101 ) Summary: Update submodule ideep to include ideep conv changes: modify convolution_forward to support broadcast add fusion. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141101 Approved by: https://github.com/Skylion007, https://github.com/jgong5	2024-11-21 12:26:24 +00:00
Xuehai Pan	ecf3bae40a	[dynamo] support `operator.methodcaller` (#141137 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/141137 Approved by: https://github.com/jansel ghstack dependencies: #141122	2024-11-21 09:13:23 +00:00
Pian Pawakapan	1132b6764a	[draft export] generate fake outputs when real tensor prop finds mismatches (#139766 ) Currently real tensor tracing raises MetadataMismatchErrors if registered fake kernels don't match the real kernels (e.g. shape, aliasing, dtype, etc.). This adds an option to use fake kernel inference to bypass mismatches - this option defaults to False for real tensor tracing, but is on for draft export. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139766 Approved by: https://github.com/angelayi, https://github.com/zou3519	2024-11-21 08:01:09 +00:00
Ke Wen	66476617bf	[Dist][CI] Easier override of destroy-upon-exit setting (#141192 ) Adding `destroy_pg_upon_exit` property to allow derived Test classes to control whether auto destroy is desired. (Otherwise, derived test classes will need to rewrite the `_run()` method, leading to duplicated code of `_run()` and if one needs to add things to `_run` in the future, more code change is needed.) Pull Request resolved: https://github.com/pytorch/pytorch/pull/141192 Approved by: https://github.com/wconstab	2024-11-21 07:32:56 +00:00
Xuehai Pan	d65f194ab9	[dynamo] support `operator.attrgetter` and `operator.itemgetter` (#141122 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/141122 Approved by: https://github.com/Skylion007, https://github.com/jansel	2024-11-21 06:48:33 +00:00
Animesh Jain	fb529c2c84	[dynamo] skip_guard_eval_unsafe stance for power users (#140251 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/140251 Approved by: https://github.com/jansel ghstack dependencies: #140223, #140250	2024-11-21 06:28:58 +00:00
Oguz Ulgen	7392e88219	Instead of using node.meta read from side table directly (#141146 ) When a transformation phase copies/modifies a node, it might drop node.meta, same as graph.meta, so they are not a good storage locations. instead directly read from the side table. Differential Revision: [D66249968](https://our.internmc.facebook.com/intern/diff/D66249968/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/141146 Approved by: https://github.com/ezyang	2024-11-21 06:19:12 +00:00
xadupre	0a4bcbf39c	[ONNX] Add support for torch.cond/HOP in onnx exporter (#137428 ) This PR implements the framework for supporting HOP in the ONNX exporter. Refer to https://github.com/pytorch/pytorch/issues/140995 for the design. - Implement support for torch.cond - Refactor `_add_nodes` into `_translate_fx_graph` to handle nested subgraphs. To support building subgraphs as functions using the same logic, new handlers for `placeholder` and `output` nodes are added to register inputs and outputs on the onnx function. - Fuctions are created under the domain of `pkg.torch.__subgraph__` - Updated the type promotion pass to run on nested subgraphs. - Implement torch.cond in `_torchlib/ops/hop.py`. Updated the registry to discover these ops. - Improve opset_import handling robustness with `add_opset_imports` IR pass. To achieve this, we added opset version to all Nodes. Fixes https://github.com/pytorch/pytorch/issues/139503 Fixes #117655 Fixes #123972 Fixes #93743 Closes https://github.com/pytorch/pytorch/issues/140995 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137428 Approved by: https://github.com/justinchuby Co-authored-by: Justin Chu <justinchuby@users.noreply.github.com>	2024-11-21 03:02:43 +00:00
Syed Tousif Ahmed	e0482fdf95	Implements user buffer registration using MemPool (#133603 ) This PR implements user buffer registration and demonstrates NVLink Sharp (NVLS) reductions using a combination of allocation special memory using MemPool and registering it with the nccl buffer registration APIs. Part of https://github.com/pytorch/pytorch/issues/124807. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133603 Approved by: https://github.com/kwen2501, https://github.com/eqy	2024-11-21 01:40:11 +00:00
Junjie Wang (PyTorch)	b44ecd91ba	[c10d] Switch all timer logging in c10d to wait_counter (#141154 ) Summary: The original decorator based time logger is bad in performance and capacity. So we want to replace it with pytorch `_WaitCounter` now. Test Plan: Tested on workload and no regression has been seen: https://fburl.com/scuba/aps_instrumentation_components/mskj73ea Differential Revision: D66218675 Pull Request resolved: https://github.com/pytorch/pytorch/pull/141154 Approved by: https://github.com/wz337	2024-11-21 01:10:11 +00:00
IvanKobzarev	225d3f4495	[subclasses] Subclass parameterization to not wrap-unwrap on forward (#140632 ) One of the common use cases of tensor Subclasses is to replace all model Parameters with Subclass that provides alternative implementation of common ops. E.g. quantization replaces weights to QuantizedSubclass. AotAutograd lifts up Parameters to graph arguments and wraps graph execution at runtime with wrapping/unwrapping of those subclasses. Even if one unwrapping is not critically big ~14us, when we have to unwrap/wrap all linear weights, that could result in substantial addition to runtime, which can be more than compiled region execution time. E.g. 20 weights * 14us = 0.3ms. This is parametrization to unwrap tensor subclasses, that is used in torch.ao: https://github.com/pytorch/ao/blob/main/torchao/utils.py#L294 It adds parametrization to unwrap tensor subclasses to plain tensors. As a result the registered parameters are changed (all registered parameters will become plain tensors) and state_dict is not compatible before/after transformation. This transformation is used before dynamo and does breaking changes, so we keep it for user to be used explicitly. Testing: ``` TORCH_LOGS="graph_code,aot" python test/functorch/test_aotdispatch.py -k test_subclass_parameters ``` ``` TORCH_LOGS="graph_code,aot,export" python test/dynamo/test_export.py -k test_subclass_parameters ``` ``` TRACED GRAPH ===== pre insert_deferred_runtime_asserts __compiled_fn_1 ===== <eval_with_key>.0 class GraphModule(torch.nn.Module): def forward(self, L_self_modules_parametrizations_modules_p1_parameters_original0_: "f32[3, 4]", L_x_: "f32[3, 4]", L_self_modules_parametrizations_modules_p2_parameters_original0_: "f32[3, 4]", L_self_modules_parametrizations_modules_p2_parameters_original1_: "f32[3, 4]"): l_self_modules_parametrizations_modules_p1_parameters_original0_ = L_self_modules_parametrizations_modules_p1_parameters_original0_ l_x_ = L_x_ l_self_modules_parametrizations_modules_p2_parameters_original0_ = L_self_modules_parametrizations_modules_p2_parameters_original0_ l_self_modules_parametrizations_modules_p2_parameters_original1_ = L_self_modules_parametrizations_modules_p2_parameters_original1_ # File: /data/users/ivankobzarev/a/pytorch/torch/testing/_internal/subclasses.py:42 in __tensor_unflatten__, code: return WrapperSubclass(a, outer_size, outer_stride) rebuilt: "f32[3, 4]" = torch.testing._internal.subclasses.WrapperSubclass(l_self_modules_parametrizations_modules_p1_parameters_original0_, None, None); l_self_modules_parametrizations_modules_p1_parameters_original0_ = None # File: /data/users/ivankobzarev/a/pytorch/test/functorch/test_aotdispatch.py:6301 in forward, code: return x + 2 * self.p1 + self.p2 mul: "f32[3, 4]" = 2 * rebuilt; rebuilt = None add: "f32[3, 4]" = l_x_ + mul; l_x_ = mul = None # File: /data/users/ivankobzarev/a/pytorch/torch/testing/_internal/two_tensor.py:58 in __tensor_unflatten__, code: return TwoTensor(a, b, outer_size, outer_stride) rebuilt_1: "f32[3, 4]" = torch.testing._internal.two_tensor.TwoTensor(l_self_modules_parametrizations_modules_p2_parameters_original0_, l_self_modules_parametrizations_modules_p2_parameters_original1_, None, None); l_self_modules_parametrizations_modules_p2_parameters_original0_ = l_self_modules_parametrizations_modules_p2_parameters_original1_ = None # File: /data/users/ivankobzarev/a/pytorch/test/functorch/test_aotdispatch.py:6301 in forward, code: return x + 2 * self.p1 + self.p2 add_1: "f32[3, 4]" = add + rebuilt_1; add = rebuilt_1 = None return (add_1,) ACED GRAPH ==== __compiled_fn_1 ===== data/users/ivankobzarev/a/pytorch/torch/fx/_lazy_graph_module.py class GraphModule(torch.nn.Module): def forward(self, L_self_modules_parametrizations_modules_p1_parameters_original0_: "f32[3, 4][4, 1]cpu", L_x_: "f32[3, 4][4, 1]cpu", L_self_modules_parametrizations_modules_p2_parameters_original0_: "f32[3, 4][4, 1]cpu", L_self_modules_parametrizations_modules_p2_parameters_original1_: "f32[3, 4][4, 1]cpu"): l_self_modules_parametrizations_modules_p1_parameters_original0_ = L_self_modules_parametrizations_modules_p1_parameters_original0_ l_x_ = L_x_ l_self_modules_parametrizations_modules_p2_parameters_original0_ = L_self_modules_parametrizations_modules_p2_parameters_original0_ l_self_modules_parametrizations_modules_p2_parameters_original1_ = L_self_modules_parametrizations_modules_p2_parameters_original1_ # File: /data/users/ivankobzarev/a/pytorch/torch/testing/_internal/subclasses.py:42 in __tensor_unflatten__, code: return WrapperSubclass(a, outer_size, outer_stride) rebuilt: "f32[3, 4][4, 1]cpu" = torch.testing._internal.subclasses.WrapperSubclass(l_self_modules_parametrizations_modules_p1_parameters_original0_, None, None); l_self_modules_parametrizations_modules_p1_parameters_original0_ = None # File: /data/users/ivankobzarev/a/pytorch/test/functorch/test_aotdispatch.py:6301 in forward, code: return x + 2 * self.p1 + self.p2 mul: "f32[3, 4][4, 1]cpu" = 2 * rebuilt; rebuilt = None add: "f32[3, 4][4, 1]cpu" = l_x_ + mul; l_x_ = mul = None # File: /data/users/ivankobzarev/a/pytorch/torch/testing/_internal/two_tensor.py:58 in __tensor_unflatten__, code: return TwoTensor(a, b, outer_size, outer_stride) rebuilt_1: "f32[3, 4][4, 1]cpu" = torch.testing._internal.two_tensor.TwoTensor(l_self_modules_parametrizations_modules_p2_parameters_original0_, l_self_modules_parametrizations_modules_p2_parameters_original1_, None, None); l_self_modules_parametrizations_modules_p2_parameters_original0_ = l_self_modules_parametrizations_modules_p2_parameters_original1_ = None # File: /data/users/ivankobzarev/a/pytorch/test/functorch/test_aotdispatch.py:6301 in forward, code: return x + 2 * self.p1 + self.p2 add_1: "f32[3, 4][4, 1]cpu" = add + rebuilt_1; add = rebuilt_1 = None return (add_1,) .py:381] [0/0] [__aot_joint_graph] TRACED GRAPH .py:381] [0/0] [__aot_joint_graph] ===== Joint graph 0 ===== .py:381] [0/0] [__aot_joint_graph] /data/users/ivankobzarev/a/pytorch/torch/fx/_lazy_graph_module.py class joint_fn(torch.nn.Module): .py:381] [0/0] [__aot_joint_graph] def forward(self, primals, tangents): .py:381] [0/0] [__aot_joint_graph] primals_1: "f32[3, 4][4, 1]cpu"; primals_2: "f32[3, 4][4, 1]cpu"; primals_3: "f32[3, 4][4, 1]cpu"; primals_4: "f32[3, 4][4, 1]cpu"; tangents_1: "f32[3, 4][4, 1]cpu"; tangents_2: "f32[3, 4][4, 1]cpu"; .py:381] [0/0] [__aot_joint_graph] .py:381] [0/0] [__aot_joint_graph] primals_1, primals_2, primals_3, primals_4, tangents_1, tangents_2, = fx_pytree.tree_flatten_spec([primals, tangents], self._in_spec) .py:381] [0/0] [__aot_joint_graph] # File: /data/users/ivankobzarev/a/pytorch/test/functorch/test_aotdispatch.py:6301 in forward, code: return x + 2 * self.p1 + self.p2 .py:381] [0/0] [__aot_joint_graph] mul: "f32[3, 4][4, 1]cpu" = torch.ops.aten.mul.Tensor(primals_1, 2); primals_1 = None .py:381] [0/0] [__aot_joint_graph] add: "f32[3, 4][4, 1]cpu" = torch.ops.aten.add.Tensor(primals_2, mul); primals_2 = mul = None .py:381] [0/0] [__aot_joint_graph] add_1: "f32[3, 4][4, 1]cpu" = torch.ops.aten.add.Tensor(add, primals_3); primals_3 = None .py:381] [0/0] [__aot_joint_graph] add_2: "f32[3, 4][4, 1]cpu" = torch.ops.aten.add.Tensor(add, primals_4); add = primals_4 = None .py:381] [0/0] [__aot_joint_graph] return pytree.tree_unflatten([add_1, add_2, None, None, None, None], self._out_spec) .py:381] [0/0] [__aot_joint_graph] .py:381] [0/0] [__aot_joint_graph] graph_code] TRACED GRAPH graph_code] ===== tensorify_python_scalars ===== graph_code] /data/users/ivankobzarev/a/pytorch/torch/fx/_lazy_graph_module.py class joint_fn(torch.nn.Module): graph_code] def forward(self, primals, tangents): graph_code] primals_1: "f32[3, 4]"; primals_2: "f32[3, 4]"; primals_3: "f32[3, 4]"; primals_4: "f32[3, 4]"; tangents_1: "f32[3, 4]"; tangents_2: "f32[3, 4]"; graph_code] graph_code] primals_1, primals_2, primals_3, primals_4, tangents_1, tangents_2, = fx_pytree.tree_flatten_spec([primals, tangents], self._in_spec) graph_code] # File: /data/users/ivankobzarev/a/pytorch/test/functorch/test_aotdispatch.py:6301 in forward, code: return x + 2 * self.p1 + self.p2 graph_code] mul: "f32[3, 4]" = torch.ops.aten.mul.Tensor(primals_1, 2); primals_1 = None graph_code] add: "f32[3, 4]" = torch.ops.aten.add.Tensor(primals_2, mul); primals_2 = mul = None graph_code] add_1: "f32[3, 4]" = torch.ops.aten.add.Tensor(add, primals_3); primals_3 = None graph_code] add_2: "f32[3, 4]" = torch.ops.aten.add.Tensor(add, primals_4); add = primals_4 = None graph_code] return pytree.tree_unflatten([add_1, add_2, None, None, None, None], self._out_spec) graph_code] graph_code] .py:463] [0/0] [__aot_graphs] aot_config id: 0, fw_metadata=ViewAndMutationMeta(input_info=[InputAliasInfo(is_leaf=True, mutates_data=False, mutates_metadata=False, mutations_hidden_from_autograd=True, mutations_under_no_grad_or_inference_mode=False, mutation_inductor_storage_resize=False, mutates_storage_metadata=False, requires_grad=True, keep_input_mutations=True), InputAliasInfo(is_leaf=True, mutates_data=False, mutates_metadata=False, mutations_hidden_from_autograd=True, mutations_under_no_grad_or_inference_mode=False, mutation_inductor_storage_resize=False, mutates_storage_metadata=False, requires_grad=False, keep_input_mutations=True), InputAliasInfo(is_leaf=True, mutates_data=False, mutates_metadata=False, mutations_hidden_from_autograd=True, mutations_under_no_grad_or_inference_mode=False, mutation_inductor_storage_resize=False, mutates_storage_metadata=False, requires_grad=True, keep_input_mutations=True), InputAliasInfo(is_leaf=True, mutates_data=False, mutates_metadata=False, mutations_hidden_from_autograd=True, mutations_under_no_grad_or_inference_mode=False, mutation_inductor_storage_resize=False, mutates_storage_metadata=False, requires_grad=True, keep_input_mutations=True)], output_info=[OutputAliasInfo(output_type=<OutputType.non_alias: 1>, raw_type=<class 'torch.testing._internal.subclasses.WrapperSubclass'>, base_idx=None, dynamic_dims=set(), requires_grad=True, functional_tensor=None)], num_intermediate_bases=0, keep_input_mutations=True, traced_tangents=[WrapperSubclass(TwoTensor(FakeTensor(..., size=(3, 4)), FakeTensor(..., size=(3, 4))))], subclass_inp_meta=[PlainTensorMeta(unwrapped_idx=0, memory_format=None), PlainTensorMeta(unwrapped_idx=1, memory_format=None), PlainTensorMeta(unwrapped_idx=2, memory_format=None), PlainTensorMeta(unwrapped_idx=3, memory_format=None)], subclass_fw_graph_out_meta=[SubclassCreationMeta(flat_tensor_start_idx=0, arg_count=2, included_subclass_symints=True, attrs={'a': SubclassCreationMeta(flat_tensor_start_idx=0, arg_count=2, included_subclass_symints=True, attrs={'a': PlainTensorMeta(unwrapped_idx=1, memory_format=None), 'b': PlainTensorMeta(unwrapped_idx=2, memory_format=None)}, outer_size=torch.Size([3, 4]), outer_stride=(4, 1), meta=None, original_subclass=TwoTensor(FakeTensor(..., size=(3, 4)), FakeTensor(..., size=(3, 4))), original_subclass_type=None, memory_format=None)}, outer_size=torch.Size([3, 4]), outer_stride=(4, 1), meta=None, original_subclass=WrapperSubclass(TwoTensor(FakeTensor(..., size=(3, 4)), FakeTensor(..., size=(3, 4)))), original_subclass_type=None, memory_format=None)], subclass_tangent_meta=[SubclassCreationMeta(flat_tensor_start_idx=0, arg_count=2, included_subclass_symints=False, attrs={'a': SubclassCreationMeta(flat_tensor_start_idx=0, arg_count=2, included_subclass_symints=False, attrs={'a': PlainTensorMeta(unwrapped_idx=1, memory_format=torch.contiguous_format), 'b': PlainTensorMeta(unwrapped_idx=2, memory_format=torch.contiguous_format)}, outer_size=torch.Size([3, 4]), outer_stride=(4, 1), meta=None, original_subclass=TwoTensor(FakeTensor(..., size=(3, 4)), FakeTensor(..., size=(3, 4))), original_subclass_type=None, memory_format=torch.contiguous_format)}, outer_size=torch.Size([3, 4]), outer_stride=(4, 1), meta=None, original_subclass=WrapperSubclass(TwoTensor(FakeTensor(..., size=(3, 4)), FakeTensor(..., size=(3, 4)))), original_subclass_type=None, memory_format=torch.contiguous_format)], is_train=True, traced_tangent_metas=None, num_symints_saved_for_bw=0, grad_enabled_mutation=None, deterministic=False, static_input_indices=[0, 2, 3], tokens={}, indices_of_inputs_that_requires_grad_with_mutations_in_bw=[], bw_donated_idxs=[], num_backward_tokens=0), inner_meta=ViewAndMutationMeta(input_info=[InputAliasInfo(is_leaf=True, mutates_data=False, mutates_metadata=False, mutations_hidden_from_autograd=True, mutations_under_no_grad_or_inference_mode=False, mutation_inductor_storage_resize=False, mutates_storage_metadata=False, requires_grad=True, keep_input_mutations=True), InputAliasInfo(is_leaf=True, mutates_data=False, mutates_metadata=False, mutations_hidden_from_autograd=True, mutations_under_no_grad_or_inference_mode=False, mutation_inductor_storage_resize=False, mutates_storage_metadata=False, requires_grad=False, keep_input_mutations=True), InputAliasInfo(is_leaf=True, mutates_data=False, mutates_metadata=False, mutations_hidden_from_autograd=True, mutations_under_no_grad_or_inference_mode=False, mutation_inductor_storage_resize=False, mutates_storage_metadata=False, requires_grad=True, keep_input_mutations=True), InputAliasInfo(is_leaf=True, mutates_data=False, mutates_metadata=False, mutations_hidden_from_autograd=True, mutations_under_no_grad_or_inference_mode=False, mutation_inductor_storage_resize=False, mutates_storage_metadata=False, requires_grad=True, keep_input_mutations=True)], output_info=[OutputAliasInfo(output_type=<OutputType.non_alias: 1>, raw_type=<class 'torch._subclasses.functional_tensor.FunctionalTensor'>, base_idx=None, dynamic_dims=set(), requires_grad=False, functional_tensor=None), OutputAliasInfo(output_type=<OutputType.non_alias: 1>, raw_type=<class 'torch._subclasses.functional_tensor.FunctionalTensor'>, base_idx=None, dynamic_dims=set(), requires_grad=False, functional_tensor=None)], num_intermediate_bases=0, keep_input_mutations=True, traced_tangents=[], subclass_inp_meta=[PlainTensorMeta(unwrapped_idx=0, memory_format=None), PlainTensorMeta(unwrapped_idx=1, memory_format=None), PlainTensorMeta(unwrapped_idx=2, memory_format=None), PlainTensorMeta(unwrapped_idx=3, memory_format=None)], subclass_fw_graph_out_meta=[PlainTensorMeta(unwrapped_idx=0, memory_format=None), PlainTensorMeta(unwrapped_idx=1, memory_format=None)], subclass_tangent_meta=[], is_train=True, traced_tangent_metas=None, num_symints_saved_for_bw=0, grad_enabled_mutation=None, deterministic=None, static_input_indices=[0], tokens={}, indices_of_inputs_that_requires_grad_with_mutations_in_bw=[], bw_donated_idxs=[], num_backward_tokens=0) .py:569] [0/0] [__aot_graphs] TRACED GRAPH .py:569] [0/0] [__aot_graphs] ===== Forward graph 0 ===== .py:569] [0/0] [__aot_graphs] /data/users/ivankobzarev/a/pytorch/torch/fx/_lazy_graph_module.py class GraphModule(torch.nn.Module): .py:569] [0/0] [__aot_graphs] def forward(self, primals_1: "f32[3, 4][4, 1]cpu", primals_2: "f32[3, 4][4, 1]cpu", primals_3: "f32[3, 4][4, 1]cpu", primals_4: "f32[3, 4][4, 1]cpu"): .py:569] [0/0] [__aot_graphs] # File: /data/users/ivankobzarev/a/pytorch/test/functorch/test_aotdispatch.py:6301 in forward, code: return x + 2 * self.p1 + self.p2 .py:569] [0/0] [__aot_graphs] mul: "f32[3, 4][4, 1]cpu" = torch.ops.aten.mul.Tensor(primals_1, 2); primals_1 = None .py:569] [0/0] [__aot_graphs] add: "f32[3, 4][4, 1]cpu" = torch.ops.aten.add.Tensor(primals_2, mul); primals_2 = mul = None .py:569] [0/0] [__aot_graphs] add_1: "f32[3, 4][4, 1]cpu" = torch.ops.aten.add.Tensor(add, primals_3); primals_3 = None .py:569] [0/0] [__aot_graphs] add_2: "f32[3, 4][4, 1]cpu" = torch.ops.aten.add.Tensor(add, primals_4); add = primals_4 = None .py:569] [0/0] [__aot_graphs] return (add_1, add_2) .py:569] [0/0] [__aot_graphs] .py:569] [0/0] [__aot_graphs] .py:580] [0/0] [__aot_graphs] TRACED GRAPH .py:580] [0/0] [__aot_graphs] ===== Backward graph 0 ===== .py:580] [0/0] [__aot_graphs] /data/users/ivankobzarev/a/pytorch/torch/fx/_lazy_graph_module.py class GraphModule(torch.nn.Module): .py:580] [0/0] [__aot_graphs] def forward(self, tangents_1: "f32[3, 4][4, 1]cpu", tangents_2: "f32[3, 4][4, 1]cpu"): .py:580] [0/0] [__aot_graphs] return (None, None, None, None) .py:580] [0/0] [__aot_graphs] .py:580] [0/0] [__aot_graphs] ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/140632 Approved by: https://github.com/bdhirsh	2024-11-21 01:09:33 +00:00
William Wen	5c45984cce	skip complex logaddexp tests in 3.12+ (#140731 ) This test is failing locally in 3.12 and 3.13 and is blocking 3.13 CI enablement. It may have to do with scipy version, see .ci/docker/requirements-ci.txt (3.12+ has scipy 1.12.0/1.14.1, where as < 3.12 requires scipy 1.10.1). Wanted to xfail these tests, but they somehow pass sometimes on CI. Pull Request resolved: https://github.com/pytorch/pytorch/pull/140731 Approved by: https://github.com/ezyang, https://github.com/malfet	2024-11-21 01:08:07 +00:00
Nikita Shulga	6882b398a4	[Doc] Remove mention of Intel Macs (#141182 ) As we are no longer supporting those. At mention that MPS support needs Ventura+. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141182 Approved by: https://github.com/clee2000, https://github.com/atalman	2024-11-21 01:05:12 +00:00
Nikita Shulga	2d52f7946b	[BE] Use `torch.log1p(x)` instead of `torch.log(1+x)` (#141167 ) To fix TOR107 linter violations Found while trying to migrate PyTorch to latest torchfix Pull Request resolved: https://github.com/pytorch/pytorch/pull/141167 Approved by: https://github.com/kit1980, https://github.com/Skylion007	2024-11-21 00:36:20 +00:00
cyyever	cd942d00dd	[Environment Variable][7/N] Use thread-safe getenv functions (#140211 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/140211 Approved by: https://github.com/ezyang, https://github.com/eqy	2024-11-21 00:25:20 +00:00
James Wu	c7d072db99	[AOTAutogradCache] Allowlist various ops found from models to safe list (#140825 ) From running internal models, I found a bunch of AOTAutogradCache ops that seem safe to cache. Would appreciate any suggestions for how to allowlist these in a more general way, but starting with these for now. Differential Revision: [D66010326](https://our.internmc.facebook.com/intern/diff/D66010326/) NOTE FOR REVIEWERS: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D66010326/)! Pull Request resolved: https://github.com/pytorch/pytorch/pull/140825 Approved by: https://github.com/bdhirsh	2024-11-21 00:04:17 +00:00
Colin L. Rice	1d6ca50c5b	config: Throw if justknobs value is not a boolean (#139488 ) This helps avoid an issue, where someone uses a mutable type that justknobs does not support within the code. And then it gets overriden to a different type Pull Request resolved: https://github.com/pytorch/pytorch/pull/139488 Approved by: https://github.com/ezyang	2024-11-20 23:52:17 +00:00
Bin Bao	040af3053a	[AOTI] Fix a two-pass kernel missmatch (#141041 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/140766. In AOTI's two-pass codegen, the first pass generates triton_per_fused_add_native_layer_norm_4, and the second pass generates triton_red_fused_add_native_layer_norm_4. While this problem will go away with the incoming one-pass implementation, further debugging reveals there is a mismatch in has_non_contiguous_pw_in_reduction_kernel between the two passes, due to a symbol comparsion problem in stride1_for_last_dim. Differential Revision: [D66203298](https://our.internmc.facebook.com/intern/diff/D66203298) Pull Request resolved: https://github.com/pytorch/pytorch/pull/141041 Approved by: https://github.com/shunting314	2024-11-20 23:34:24 +00:00
Bob Ren	ed9135a732	add jk for unspecialize float killswitch (#141143 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/141143 Approved by: https://github.com/c00w	2024-11-20 23:20:52 +00:00
Aaron Gokaslan	765a347d21	[BE]: Update CUDNN for Linux to 9.5.1.17 for 12.6 only (#137978 ) * Significantly faster, better CUDNN Attention especially on Hopper (FA3 implementation?) * Lots of bugfixes * Better performance * More numerically stable / fixed heuristics * More functionality for SDPA Pull Request resolved: https://github.com/pytorch/pytorch/pull/137978 Approved by: https://github.com/eqy, https://github.com/drisspg, https://github.com/nWEIdia, https://github.com/atalman, https://github.com/malfet	2024-11-20 23:11:39 +00:00
Nikita Shulga	93efddc67a	Use `pip` corresponding to python executable (#141165 ) Sometimes `python3` and `pip` are aliased to different runtimes, so it's better to always use `pip3`, but as linter should install packages into the same python environment, it's even better to just call sys.executable with `-mpip install XYZ` arguments Fixes regression introduced by https://github.com/pytorch/pytorch/pull/124033 Pull Request resolved: https://github.com/pytorch/pytorch/pull/141165 Approved by: https://github.com/Skylion007, https://github.com/kit1980	2024-11-20 22:58:33 +00:00
Aleksei Nikiforov	a82bab6419	Run only listed tests on s390x (#140265 ) Skip tests that are failing This was previously part of https://github.com/pytorch/pytorch/pull/125401 Pull Request resolved: https://github.com/pytorch/pytorch/pull/140265 Approved by: https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2024-11-20 22:53:09 +00:00
PyTorch MergeBot	701e06b643	Revert "Move Sympy printers to torch/utils/_sympy/printers.py (#140597 )" This reverts commit aefcdb3c9fa787f9d43864f6f99a3590c914324a. Reverted https://github.com/pytorch/pytorch/pull/140597 on behalf of https://github.com/huydhn due to Sorry for reverting your change but I think it fails inductor/test_padding in trunk. This is a target determination miss and that failed test was not run in your PR ([comment](https://github.com/pytorch/pytorch/pull/140597#issuecomment-2489641453))	2024-11-20 22:13:57 +00:00
PyTorch MergeBot	abaab5da05	Revert "Add back DistributedDataParallel types that were lost when pyi was removed (#136835 )" This reverts commit 4c9e77d71e3f4ff9bec6fb5de98789f041f70a61. Reverted https://github.com/pytorch/pytorch/pull/136835 on behalf of https://github.com/izaitsevfb due to breaking typechecks in meta code ([comment](https://github.com/pytorch/pytorch/pull/136835#issuecomment-2489638528))	2024-11-20 22:11:19 +00:00
Shangdi Yu	5c37b20d13	Fix autocast HOP pass for nested autocast (#141065 ) Test Plan: ``` buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test:test_export -- -r "test_predispatch_autocast" ``` Differential Revision: D65970066 @diff-train-skip-merge Pull Request resolved: https://github.com/pytorch/pytorch/pull/141065 Approved by: https://github.com/angelayi	2024-11-20 21:57:11 +00:00
Tugsbayasgalan Manlaibaatar	87f9c1abe5	Change export IR to non-functional pre-dispatch IR (#139511 ) Differential Revision: [D65362160](https://our.internmc.facebook.com/intern/diff/D65362160) State after this IR: 1. For the tests that require inference IR, they are replaced with ep.run_decomp({}) so export_for_training_run_decomp is sort of redundant but i guess it is still nice that multiple round of retracing still working. In general, we need some auditing to reduce our redundant testing coverages. 2. After this PR landed and not get reverted for a week or so, i will replace the export_for_training calls with export as they are the same thing now. 3. Added more tests to also cover now "deprecated" old IR by patching export to use old export. For reviewers, please look at the internal version. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139511 Approved by: https://github.com/ydwu4, https://github.com/angelayi, https://github.com/avikchaudhuri	2024-11-20 21:47:55 +00:00
Bob Ren	f3f7ba5a69	Restart dynamo analysis when we fail to tensorify away all symfloat inputs (#140346 ) Fixes a bunch of benchmarks that failed with cudagraph errors including `tlp python benchmarks/dynamo/timm_models.py --device cuda --inductor --accuracy --amp --training --only resmlp_12_224` when `specialize_float=False` Also brings down number of overall failures (with keep-going) from 108 => 62. I'd estimate >80% of those 62 are wobbly expect tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/140346 Approved by: https://github.com/ezyang ghstack dependencies: #140983, #141003	2024-11-20 21:20:41 +00:00
Yidi Wu	4b3ce62946	[while_loop] support pytree inputs (#140059 ) Previously, we only support carries to be tuple of tensors. This pr enables us to support pytree of tensors. Pull Request resolved: https://github.com/pytorch/pytorch/pull/140059 Approved by: https://github.com/zou3519	2024-11-20 21:12:29 +00:00
Yuanhao Ji	2ee2dcb736	[Device] Add mps as device type in `torch._utils._get_available_device_type()` (#141098 ) As the title states Pull Request resolved: https://github.com/pytorch/pytorch/pull/141098 Approved by: https://github.com/malfet	2024-11-20 20:45:59 +00:00
Catherine Lee	2e3c0c489d	Continuous job for pulling artifacts and doing upload (#140453 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/140453 Approved by: https://github.com/huydhn	2024-11-20 20:41:52 +00:00
Shangdi Yu	d5ee1d1b58	Remove capture_pre_autograd_graph in test_aot_inductor (#141064 ) Summary: as title Test Plan: CI Differential Revision: D66191296 Pull Request resolved: https://github.com/pytorch/pytorch/pull/141064 Approved by: https://github.com/zhxchen17	2024-11-20 20:34:46 +00:00
Isuru Fernando	aefcdb3c9f	Move Sympy printers to torch/utils/_sympy/printers.py (#140597 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/140597 Approved by: https://github.com/ezyang, https://github.com/anijain2305	2024-11-20 20:26:49 +00:00
Chien-Lin Chen	161425ff9f	Added aten.bernoulli.p and aten.bernoulli.default decompositions (#139141 ) Fixes #105519 Added aten.bernoulli.p decomposition and moved/rewrote aten.bernoulli.deafult to make them included in core aten decomposition. Tested the sample code in [105519](https://github.com/pytorch/pytorch/issues/105519), torch.bernoulli could be decomposed by the code snippet. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139141 Approved by: https://github.com/eellison	2024-11-20 19:52:57 +00:00
Roy Hvaara	bc69a19139	[MPS] Add support for bf16 autocast (#139390 ) This PR adds support for bf16 autocast. Most of the code and ideas are copied from #99272. Most of the heavy lifting was done by AI. Fixes #139386 Pull Request resolved: https://github.com/pytorch/pytorch/pull/139390 Approved by: https://github.com/malfet Co-authored-by: Kulin Seth <kulin_seth@apple.com> Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2024-11-20 19:52:28 +00:00
Jason Ansel	808f0f656d	[inductor] Refactor MutableBox to make IRNode typing easier (#140895 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/140895 Approved by: https://github.com/ezyang, https://github.com/Skylion007	2024-11-20 19:50:46 +00:00
Nikita Shulga	a8794fd7df	[MPS] Fix conv backward pass for channels last (#141009 ) Looks like a regression caused by use of strided API, but adding the test revealed (at least in CI), that on Ventura it worked but returned garbage results, so fixed by removing all the logic about channels last (as it's irrelevant for strided API case and placeholder already turns tensor into a correct one) This also allows one to remove `mem_format_key` and `ns_shape_key` (it was redundant even back then, as `mem_format_key` + `getTensorsStringKey(grad_output_t)` already uniquely identified the operation) Fixes https://github.com/pytorch/pytorch/issues/140902 Pull Request resolved: https://github.com/pytorch/pytorch/pull/141009 Approved by: https://github.com/manuelcandales	2024-11-20 19:50:31 +00:00
Dmitry Nikolaev	c9db2c6328	[ROCm] cudagraph explicit sync only after capture_begin() (#138722 ) hipGraphExecDestroy doesn't immediately free memory since rocm6.2. They wait for next sync point in order to free the memory, this is to ensure that all hipGraphLaunch are finished before we release any memory. We need to ensure all async opreations finish before deleting the object. capture_dev_ variable is used to save the device number when capture_begin() method is called But CUDAGraph can be created and destroyed without calling capture_begin() method. `capture_dev_ = UNDEFINED_DEVICE;` allows to detect such a case and skip sync Tests impacted: test_cuda.py::TestCuda::test_graph_make_graphed_callables_* distributed/test_c10d_nccl.py::ProcessGroupNCCLTest::test_allreduce_in_cudagraph Pull Request resolved: https://github.com/pytorch/pytorch/pull/138722 Approved by: https://github.com/malfet, https://github.com/eqy, https://github.com/jeffdaily	2024-11-20 19:37:22 +00:00
Laith Sakka	caa3a3e12c	Only compute new_untracked_symbols and new_unbacked_bindings if needed. (#140083 ) Summary: 237s -> 198.. buck2 run fbcode//mode/opt fbcode//torchrec/distributed/tests:pt2_compile_benchmark -- --num-features=2000 Test Plan: NA Differential Revision: D65638637 Pull Request resolved: https://github.com/pytorch/pytorch/pull/140083 Approved by: https://github.com/ezyang, https://github.com/isuruf, https://github.com/anijain2305	2024-11-20 19:28:18 +00:00
Benjamin Glass	4ffce45100	AOTInductor: properly generate cpp_wrapper runtime assertions (#141050 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/141050 Approved by: https://github.com/desertfire ghstack dependencies: #141058	2024-11-20 19:17:47 +00:00
Benjamin Glass	5c684503a6	cpp_wrapper: Fix dtype_view wrapping reinterpret_view (#141058 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/141058 Approved by: https://github.com/desertfire	2024-11-20 19:17:47 +00:00
Haifeng Jin	d3902b5e20	[dynamo][CI] Add numpy-2.X shard (follow up) (#140586 ) Fixes #107302 This is a clone and fix for #139199. This PR is a small step for the overall NumPy 2 support. It adds a new CI job for testing with NumPy 2 with one test file only. More tests to be fixed and added later in follow-up pull requests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/140586 Approved by: https://github.com/malfet Co-authored-by: Nikita Shulga <nshulga@meta.com>	2024-11-20 19:11:28 +00:00
Huy Do	b5db3cb61c	Skip uploading benchmark records when there is no model name (#141145 ) A small fix I just realize after https://github.com/pytorch/pytorch/pull/141087. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141145 Approved by: https://github.com/malfet	2024-11-20 19:05:47 +00:00
Huy Do	1a7055cb73	Record PR time benchmark results in JSON format (#140493 ) I'm trying to make this benchmark results available on OSS benchmark database, so that people can query it from outside. The first step is to also record the results in the JSON format compatible with the database schema defined in https://github.com/pytorch/test-infra/pull/5839. Existing CSV files remain unchanged. ### Testing The JSON results are uploaded as artifacts to S3 https://github.com/pytorch/pytorch/actions/runs/11809725848/job/32901411180#step:26:13, for example https://gha-artifacts.s3.amazonaws.com/pytorch/pytorch/11809725848/1/artifact/test-jsons-test-pr_time_benchmarks-1-1-linux.g4dn.metal.nvidia.gpu_32901411180.zip Pull Request resolved: https://github.com/pytorch/pytorch/pull/140493 Approved by: https://github.com/laithsakka	2024-11-20 18:54:01 +00:00
Huy Do	4acd56eb53	Upload MPS benchmark results (#141087 ) This uploads the MPS benchmark results to benchmark database. The data can then be queried, for example: ``` select benchmark, model, metric from oss_ci_benchmark_v3 where head_sha = '99a133116fee15aa1467165f2b209b37da53f189' and metric.name in ['eager_peak_mem', 'dynamo_peak_mem', 'speedup'] and model.name = 'BERT_pytorch' ``` I'm documenting the JSON format at https://github.com/pytorch/pytorch/wiki/How-to-integrate-with-PyTorch-OSS-benchmark-database ### Testing Locally, ``` PYTHONPATH=/Users/huydo/Storage/mine/benchmark python benchmarks/dynamo/torchbench.py --performance --only resnet152 --backend eager --training --devices mps --output test/test-reports/torchbench_training.csv ``` Workflow dispatch https://github.com/pytorch/pytorch/actions/runs/11927990520 Pull Request resolved: https://github.com/pytorch/pytorch/pull/141087 Approved by: https://github.com/malfet	2024-11-20 18:18:21 +00:00
Aaron Gokaslan	1d8318df98	[BE][Ez]: Reserve vector for NT GEMM Matmul (#141130 ) Easy fix to missing reserve calls in NT Matmul CUDA kernel to improve perf. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141130 Approved by: https://github.com/malfet	2024-11-20 18:12:51 +00:00
Animesh Jain	9d229f08f4	[dynamo][guards] Introduce a diff_guard_manager (#140250 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/140250 Approved by: https://github.com/jansel ghstack dependencies: #140223	2024-11-20 17:59:30 +00:00
Aaron Gokaslan	12e95aa4ee	[BE]: Apply PERF401 autofixes from ruff (#140980 ) * Automatically applies ruff rule 401. Turns loops into equivalent list comprehensions which are faster and do not leak the scope of the loop variables. * list comprehensions not only often have better typing, but are 50+% faster than for loops on overhead. They also preserve length information etc and are better for the interpreter to optimize. * Manually went back and made mypy happy after the change. * Also fixed style lints in files covered by flake8 but not by pyfmt Pull Request resolved: https://github.com/pytorch/pytorch/pull/140980 Approved by: https://github.com/justinchuby, https://github.com/malfet	2024-11-20 17:52:07 +00:00
Laith Sakka	8d708090c0	Optimize increment summations [Latest Nov 15] (#140822 ) Summary: wins on torchrec benchmark, for 2K nodes it save 40seconds with the recent sympy changes (https://www.internalfb.com/diff/D65883538) we save around 13 second ( with the max opt on). ``` buck2 run fbcode//mode/opt fbcode//torchrec/distributed/tests:pt2_compile_benchmark -- --num-features=200 ``` This diff optimizes construction expressions of the form a+b+c... (all unique symbols). which are very common in torchrec models. How Expressions of the form a+b+c are not optimized by add, the only needed optimization is sorting them. If we have a+b+c and we are adding (d) to it, we can do a binary search to know the position of (d) and avoid optimizing the new expression by passing the new order. Extensions: 1. support constant terms. 2. support 10a+10b+.. (this will give even more wins will extend the support in second PR) Differential Revision: D66008482 Pull Request resolved: https://github.com/pytorch/pytorch/pull/140822 Approved by: https://github.com/ezyang	2024-11-20 16:48:20 +00:00
Nikita Shulga	a440a01832	[MPS][BE] Let preprocessor do preprocessing (#141103 ) Instead of calling `REGISTER_FUSED_ADAM_OP` macro with 7 parameters 16 times, 4 type parameter macros for each op and then one op to define the quartet of ops: Adam, AdamW and their grad functions Pull Request resolved: https://github.com/pytorch/pytorch/pull/141103 Approved by: https://github.com/kulinseth ghstack dependencies: #141089, #141090, #141092	2024-11-20 14:03:17 +00:00
Nikita Shulga	b0deddde46	[MPS][BE] Move FusedOptimizerOps to its own shader (#141092 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/141092 Approved by: https://github.com/Skylion007, https://github.com/kulinseth ghstack dependencies: #141089, #141090	2024-11-20 14:03:17 +00:00
Yukio Siraichi	446ea2aea5	`pow`: fix meta function output argument dtype check. (#140287 ) Tracking issue: #138399 This PR changes the `pow` C++ implementation, making its C++ meta kernel consistent with its Python ref implementation. The following example shows the inconsistency between the two: ```python def run(device): S = (5,) a = torch.rand(S, device=device, dtype=torch.float32) b = 2 out = torch.empty(S, device=device, dtype=torch.float64) return torch.pow(a, b, out=out) >>> run("cpu") Traceback (most recent call last): File "test.py", line 34, in run return torch.pow(a, b, out=out) RuntimeError: Found dtype Double but expected Float >>> run("meta") tensor(..., device='meta', size=(5,), dtype=torch.float64) ``` ~Update:~ ~Note that this happens only for `pow.Tensor_Scalar` overloads. Therefore, this PR needed further 2 modifications:~ - ~Split the `pow` ref implementation, making `pow.Tensor_Scalar` error on mismatching output dtypes~ - ~Create a dispatch for `pow` when `_refs.pow()` is called~ Update: Changing the `TensorIteratorConfig` for `pow.Tensor_Scalar` was easier and, after the discussion below, more correct. The solution was to change the `TensorIteratorBase::build_output_borrowing_argument_owning_unary_op` function, setting: - `cast_common_dtype_to_outputs`; and - `enforce_safe_casting_to_output`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/140287 Approved by: https://github.com/ezyang	2024-11-20 13:28:47 +00:00
FFFrog	a9e54f64ee	Remove unused Python API named _set_torch_function_mode (#141023 ) Detailed description: As the title stated. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141023 Approved by: https://github.com/Skylion007, https://github.com/ezyang	2024-11-20 09:48:03 +00:00
FFFrog	ffb305d3a6	Fix bugs about torch.fx.experimental.proxy_tensor.make_fx (#141022 ) Detailed description: The codes below will raise an error ```Python import torch from torch.fx.experimental.proxy_tensor import make_fx def func(a): b = a + 1 c = b.view(-1) c.add_(1) return b input = torch.randn(2) out = make_fx(func)(input) ``` The error info are like below: ```Python ... File "/root/Git.d/pytorch/pytorch/torch/_dynamo/codegen.py", line 34, in <module> from .variables.torch_function import TensorWithTFOverrideVariable File "/root/Git.d/pytorch/pytorch/torch/_dynamo/variables/torch_function.py", line 185, in <module> populate_builtin_to_tensor_fn_map() File "/root/Git.d/pytorch/pytorch/torch/_dynamo/variables/torch_function.py", line 146, in populate_builtin_to_tensor_fn_map inp0 = torch.ones(1) File "/root/Git.d/pytorch/pytorch/torch/fx/experimental/proxy_tensor.py", line 1240, in __torch_function__ return func(args, kwargs) File "/root/Git.d/pytorch/pytorch/torch/utils/_stats.py", line 21, in wrapper return fn(args, **kwargs) File "/root/Git.d/pytorch/pytorch/torch/fx/experimental/proxy_tensor.py", line 1342, in __torch_dispatch__ return proxy_call(self, func, self.pre_dispatch, args, kwargs) File "/root/Git.d/pytorch/pytorch/torch/fx/experimental/proxy_tensor.py", line 907, in proxy_call name=proxy_mode.tracer.graph._target_to_str(func.overloadpacket.__name__), AttributeError: 'PythonKeyTracer' object has no attribute 'graph' ... ``` Solutions: Import torch._dynamo before dispatch_trace is called to avoid the context set before dispatch_trace from affecting the torch._dynamo import. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141022 Approved by: https://github.com/ezyang	2024-11-20 09:42:32 +00:00
Zhenbin Lin	c9c8370feb	Openreg: Add RNG Generator (#138449 ) Implement RNG Generator by falling back to CPUGeneratorImpl. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138449 Approved by: https://github.com/ezyang	2024-11-20 09:27:55 +00:00
Bob Ren	54f380f64a	Also check for attention mask shape for _sfdp_params_check (#141003 ) Fixes `python test/inductor/test_fused_attention.py SDPAPatternRewriterCpuTests.test_pattern_fails_with_unsupported_mask_cpu` when `specialize_float=False`. You might wonder how it's related, it's because there is a "negative" test that expects us not to match. Previously it would fail on isinstance(param, Tensor), but now that we tensorify the float, it did match and caused a failure. This check ensures the mask has the same shape to ensure this negative test case actually fails. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141003 Approved by: https://github.com/ezyang ghstack dependencies: #140983	2024-11-20 08:37:28 +00:00
Pian Pawakapan	d869344bc0	[export] serialize sympy.Exprs as ASTs instead of strings (#140084 ) Summary: The way we've been de/serializing sympy.Exprs is not roundtrippable in all cases (serialize by calling `str(expr)`, and deserialize by calling `sympy.sympify(expr_str)`). This has led to expressions being mathematically equivalent but structurally different, causing issues in ValueRanges. Example issue: https://github.com/pytorch/pytorch/issues/136797 This starts to deprecate the use of `expr_str` and stores expressions in AST format instead. For BC purposes, `expr_str` deserialization is still supported, but we will always serialize to `expr_ast`. We'll kill this once the serialization upgrader design is finalized and implemented. Test Plan: test_export Differential Revision: D65638757 Pull Request resolved: https://github.com/pytorch/pytorch/pull/140084 Approved by: https://github.com/angelayi	2024-11-20 07:44:25 +00:00
Shunting Zhang	7e9e83a8c6	[inductor] force contiguous layout for implicit fallback (#140996 ) Fix https://github.com/pytorch/pytorch/issues/140462 . Horace found that when we implicitly fallback to eager, some eager kernels may not work correctly if Inductor provide non-contiguous inputs (due to padding etc.). The original issue is found for the backward op of weight_norm. The fix in this PR is a general one: we force inputs to all implicit fallback kernels to be contiguous. I have to refactor the code a bit to make it work. Previously we apply layout constraint in `GraphLowering.run_node`. We looks for implicit fallback in `call_function`. The problem here is, when we setup the implicit fallback in `call_function` with a layout constraint, we don't have a chance to apply the constraints.. The refactor moves the code that applies layout constraints to `call_function`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/140996 Approved by: https://github.com/jansel	2024-11-20 06:41:17 +00:00
zeshengzong	8f3c71ad27	Add torch.sum dtype promotion description (#140939 ) Fixes #82159 Add note description about type promotion of `torch.sum`. Test Result Before ![image](https://github.com/user-attachments/assets/fb952676-f190-4680-9e15-ea8c99d33c67) After ![image](https://github.com/user-attachments/assets/ee0d46a6-5053-46d5-b412-5c919a40965a) Pull Request resolved: https://github.com/pytorch/pytorch/pull/140939 Approved by: https://github.com/zou3519	2024-11-20 06:20:01 +00:00
Sun, Jiayi	93e3c91679	[inductor] support linear+binary foldinig for freezing path (#138807 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138807 Approved by: https://github.com/jgong5, https://github.com/jansel Co-authored-by: Jiong Gong <jiong.gong@intel.com>	2024-11-20 05:34:09 +00:00
Animesh Jain	a864c42781	[dynamo][guards] Support cloning of Guard Manager (#140223 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/140223 Approved by: https://github.com/jansel	2024-11-20 05:28:45 +00:00
Mauricio Villegas	4c9e77d71e	Add back DistributedDataParallel types that were lost when pyi was removed (#136835 ) When the stub file `nn/parallel/distributed.pyi` was removed (#88701), some types that existed are no longer available. This pull request adds them back. Just for reference, these types are used in pytorch-lightning's LightningCLI. Command line interfaces are created automatically, and having type hints make them nicer. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136835 Approved by: https://github.com/kwen2501	2024-11-20 04:57:19 +00:00
zeshengzong	5ab1c51f0f	[Easy] Use nested namespaces in aten (#141012 ) Change files with nested namespaces in aten Pull Request resolved: https://github.com/pytorch/pytorch/pull/141012 Approved by: https://github.com/Skylion007	2024-11-20 04:05:23 +00:00
cyy	d91484509a	[1/N] Apply bugprone-unchecked-optional-access (#140679 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/140679 Approved by: https://github.com/ezyang	2024-11-20 04:04:41 +00:00
PyTorch MergeBot	a4e8ca789a	Revert "Record PR time benchmark results in JSON format (#140493 )" This reverts commit 783cd9c8dd8a57d58ac0260ce18253e0cc6a69b7. Reverted https://github.com/pytorch/pytorch/pull/140493 on behalf of https://github.com/huydhn due to I think I missed something in the workflow setup as the test is failing in non-test CI jobs ([comment](https://github.com/pytorch/pytorch/pull/140493#issuecomment-2487360455))	2024-11-20 04:04:07 +00:00
Songhao Jia	84d86e3767	[numeric_debugger] guard the input generate_numeric_debug_handle as GraphModule type (#140742 ) Summary: Support ExportProgram type in generate_numeric_debug_handle, to better meet the requirement Test Plan: ci Differential Revision: D65920529 Pull Request resolved: https://github.com/pytorch/pytorch/pull/140742 Approved by: https://github.com/tarun292, https://github.com/jerryzh168	2024-11-20 03:40:04 +00:00
Shangdi Yu	c05813d2a9	[AOTI Minifier] Exclude illegal graphs from minifier search (#140999 ) Summary: Some graphs produced by the minifier graph cutter cannot be used for AOTI/export (illegal graphs), these should be considered as graphs that don't fail in the minifier, so the minifier keeps searching. One example is the following graph, where `true_graph_0` is an fx.GraphModule. Here, export.export() would give a `UserError` with `ErrorType = UserErrorType.INVALID_OUTPUT`. ``` # graph(): # %true_graph_0 : [num_users=1] = get_attr[target=true_graph_0] # return (true_graph_0,) ``` This graph could be obtained from the module below: ```python class M(torch.nn.Module): def forward(self, x, flag): flag = flag.item() def true_fn(x): return x.clone() return torch.cond(flag > 0, true_fn, true_fn, [x]) ``` So we detect such errors, and exclude them from minifier's search (consider these graphs as didn't fail). This is ok and won't miss any actual errors, since the AOTI minifier is only designed to catch errors in the AOTI phase anyway, it is not responsible to catching export bugs. Test Plan: ``` buck2 run fbcode//caffe2/test/inductor:test_minifier_utils -- -r invalid_output ``` Differential Revision: D66143487 Pull Request resolved: https://github.com/pytorch/pytorch/pull/140999 Approved by: https://github.com/henrylhtsang	2024-11-20 03:20:06 +00:00
Shruthi GN	f0f9393779	add serialized_type_name to torch.size register_pytree_node (#141047 ) Summary: We are working on onboarding legokit modules to ModuleStability and this is needed to fix the serialization issue found in P1680200613 Test Plan: `buck2 test //torchrec/fb/legokit/module_stability_tests/layer_norm_stability_test:layer_norm_stability_test -- --env ADD_NEW_STABILITY_CONFIGS=True` serialization succeeds when the above command is run on top of this diff. Differential Revision: D66034492 Pull Request resolved: https://github.com/pytorch/pytorch/pull/141047 Approved by: https://github.com/angelayi	2024-11-20 03:14:10 +00:00
Nikita Shulga	fc905d92c5	[MPS][BE] Do not create 4 instances of `FUSED_ADAM_OPS` (#141090 ) Defining `static char shaderSource[]` in the header will instantiate it as often as it is included. Solved the problem by renaming `static auto getCPLState(const std::string&)` into `auto getFusedAdamCPLState(const std::string&)` and instantiating it only once resulted in 500K reduction in binary size (and perhaps even more in runtime footprint) I.e. before ``` % ls -lak lib/libtorch_cpu.dylib -rwxr-xr-x 1 malfet staff 183357744 Nov 19 17:58 lib/libtorch_cpu.dylib ``` and afer ``` % ls -lak lib/libtorch_cpu.dylib -rwxr-xr-x 1 malfet staff 183357120 Nov 19 17:57 lib/libtorch_cpu.dylib ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/141090 Approved by: https://github.com/Skylion007 ghstack dependencies: #141089	2024-11-20 03:04:33 +00:00
Nikita Shulga	a8a428df3b	[MPS][BE] Use nested namespace (#141089 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/141089 Approved by: https://github.com/Skylion007	2024-11-20 03:04:33 +00:00
Henry Tsang	da115eff86	[dynamic] Reduce stack trace logs in symbolic_shape (#141068 ) Motivation: https://github.com/pytorch/pytorch/issues/139408 To reduce excessive warning logs. You can get back previous behavior by prepending `TORCH_LOGS="dynamic" ` repro: https://github.com/pytorch/pytorch/issues/139408 after: ``` /torch/fx/experimental/symbolic_shapes.py:6452] runtime_asserts_frozen but then got 3TruncToInt(IntTrueDiv(s0, 1))TruncToInt(IntTrueDiv(s1, 1)) < 2147483648 /torch/fx/experimental/symbolic_shapes.py:6032] Ignored guard 3TruncToInt(IntTrueDiv(s0, 1))TruncToInt(IntTrueDiv(s1, 1)) < 2147483648 == True, this could result in accuracy problems /torch/fx/experimental/symbolic_shapes.py:6452] runtime_asserts_frozen but then got 2s0s1 + s1(s0 - 1) + s1 < 2147483648 /torch/fx/experimental/symbolic_shapes.py:6032] Ignored guard 2s0s1 + s1(s0 - 1) + s1 < 2147483648 == True, this could result in accuracy problems ``` before: 174 lines Differential Revision: D66196982 Pull Request resolved: https://github.com/pytorch/pytorch/pull/141068 Approved by: https://github.com/ezyang	2024-11-20 03:00:53 +00:00
Chirag Pandya	32094626f2	[fr] fix OSS broken flight recorder (#140973 ) Summary: OSS flight recorder does not work because we renamed `trace_dir` to `folder` in the internal version to reuse the loader. Fixes item #2 in reported issue: https://github.com/pytorch/pytorch/issues/140879 Test Plan: BEFORE: ``` ❯ python ./tools/flight_recorder/fr_trace.py ~/fr/140563/nccl_trace_logs --prefix nccl_trace_rank_container-node1_ tabulate is not installed. Proceeding without it. Traceback (most recent call last): File "/data/users/cpio/fbsource/fbcode/caffe2/./tools/flight_recorder/fr_trace.py", line 52, in <module> main() File "/data/users/cpio/fbsource/fbcode/caffe2/./tools/flight_recorder/fr_trace.py", line 44, in main details, version = read_dir(args) File "/home/cpio/local/pytorch/tools/flight_recorder/components/loader.py", line 89, in read_dir assert len(details) > 0, f"no files loaded from {args.folder} with prefix {prefix}" AttributeError: 'Namespace' object has no attribute 'folder' ``` AFTER: ``` python ./tools/flight_recorder/fr_trace.py ~/fr/140563/nccl_trace_logs --prefix nccl_trace_rank_container-node17_ tabulate is not installed. Proceeding without it. Traceback (most recent call last): File "/data/users/cpio/fbsource/fbcode/caffe2/./tools/flight_recorder/fr_trace.py", line 52, in <module> main() File "/data/users/cpio/fbsource/fbcode/caffe2/./tools/flight_recorder/fr_trace.py", line 45, in main db = build_db(details, args, version) File "/home/cpio/local/fbsource/fbcode/caffe2/tools/flight_recorder/components/builder.py", line 446, in build_db check_no_missing_dump_files(entries, memberships) File "/home/cpio/local/fbsource/fbcode/caffe2/tools/flight_recorder/components/utils.py", line 267, in check_no_missing_dump_files dumps_ranks == all_ranks AssertionError: Missing dump files from ranks {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119} ❯ git status fatal: not a git repository (or any parent up to mount point /data/users/cpio) Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set). ❯ python ./tools/flight_recorder/fr_trace.py ~/fr/140563/nccl_trace_logs --prefix nccl_trace_rank_container-node17_ tabulate is not installed. Proceeding without it. Traceback (most recent call last): File "/data/users/cpio/fbsource/fbcode/caffe2/./tools/flight_recorder/fr_trace.py", line 52, in <module> main() File "/data/users/cpio/fbsource/fbcode/caffe2/./tools/flight_recorder/fr_trace.py", line 45, in main db = build_db(details, args, version) File "/home/cpio/local/fbsource/fbcode/caffe2/tools/flight_recorder/components/builder.py", line 446, in build_db check_no_missing_dump_files(entries, memberships) File "/home/cpio/local/fbsource/fbcode/caffe2/tools/flight_recorder/components/utils.py", line 267, in check_no_missing_dump_files dumps_ranks == all_ranks AssertionError: Missing dump files from ranks {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119} ``` Differential Revision: D66117013 Pull Request resolved: https://github.com/pytorch/pytorch/pull/140973 Approved by: https://github.com/Skylion007, https://github.com/fduwjj	2024-11-20 02:58:11 +00:00
Colin L. Rice	241d2259d3	torch/config: fix mock behaviour (#140779 ) Mock only replaces the value that was removed, if after deletion, it does not see the attribute. Pull Request resolved: https://github.com/pytorch/pytorch/pull/140779 Approved by: https://github.com/ezyang	2024-11-20 02:57:16 +00:00
angelayi	878a849c92	[aoti] Remove example inputs from aoti_compile_and_package (#140991 ) Differential Revision: [D66136724](https://our.internmc.facebook.com/intern/diff/D66136724) Pull Request resolved: https://github.com/pytorch/pytorch/pull/140991 Approved by: https://github.com/yushangdi, https://github.com/desertfire ghstack dependencies: #140990	2024-11-20 02:49:47 +00:00
angelayi	cb6a21b033	[export] Add setattr for ep.example_inputs (#140990 ) Differential Revision: [D66136725](https://our.internmc.facebook.com/intern/diff/D66136725) Pull Request resolved: https://github.com/pytorch/pytorch/pull/140990 Approved by: https://github.com/yushangdi, https://github.com/ydwu4	2024-11-20 02:49:20 +00:00
Sam Larsen	ff17d2b83e	[easy][logging] Remove dynamo_timed fwd_only param (#140993 ) Summary: It's ignored; remove it Test Plan: CI Pull Request resolved: https://github.com/pytorch/pytorch/pull/140993 Approved by: https://github.com/ezyang	2024-11-20 02:31:51 +00:00
Huy Do	5e0c009a5a	Forward fix lint after #140443 (#141088 ) TSIA Pull Request resolved: https://github.com/pytorch/pytorch/pull/141088 Approved by: https://github.com/atalman	2024-11-20 02:21:24 +00:00
Xinran / Allan Rui	f23d034826	[PyTorch Decomp] Allow native_layernorm decomp to align [mean, rstd] output dtypes with input dtype for MTIA backend (#141025 ) Summary: As title Test Plan: CI Differential Revision: D66169328 Pull Request resolved: https://github.com/pytorch/pytorch/pull/141025 Approved by: https://github.com/bdhirsh	2024-11-20 01:58:08 +00:00
Huy Do	783cd9c8dd	Record PR time benchmark results in JSON format (#140493 ) I'm trying to make this benchmark results available on OSS benchmark database, so that people can query it from outside. The first step is to also record the results in the JSON format compatible with the database schema defined in https://github.com/pytorch/test-infra/pull/5839. Existing CSV files remain unchanged. ### Testing The JSON results are uploaded as artifacts to S3 https://github.com/pytorch/pytorch/actions/runs/11809725848/job/32901411180#step:26:13, for example https://gha-artifacts.s3.amazonaws.com/pytorch/pytorch/11809725848/1/artifact/test-jsons-test-pr_time_benchmarks-1-1-linux.g4dn.metal.nvidia.gpu_32901411180.zip Pull Request resolved: https://github.com/pytorch/pytorch/pull/140493 Approved by: https://github.com/laithsakka	2024-11-20 01:48:00 +00:00
eellison	eff22171d2	Add Current Mask Var To CSE Cache Key (#140838 ) This torch.cat kernel has multiple subblocks which load from the same input. We were incorrectly reusing the mask vars from the first load for the second load. Pull Request resolved: https://github.com/pytorch/pytorch/pull/140838 Approved by: https://github.com/jansel ghstack dependencies: #140841	2024-11-20 00:55:56 +00:00
Adnan Akhundov	b740a1b96c	[user triton] Ignore backend-specific args in the TTIR analysis (#141062 ) Fixes #140800. On AMD, backend-specific args like `matrix_instr_nonkdim`, `waves_per_eu` and `kpack` are passed either direclty to the kernel or via `triton.Config`, whereas they don't exist as kernel parameters. Native Triton code handles those excessive args [here](`a6bb57d628/python/triton/runtime/jit.py (L594-L596)`). In this PR, we add similar handling to the TTIR analysis code to avoid bailing out. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141062 Approved by: https://github.com/oulgen	2024-11-20 00:37:34 +00:00
Bob Ren	7c7c34693d	disable tensorify floats when cuda graphs is on (#140983 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/140983 Approved by: https://github.com/ezyang	2024-11-20 00:33:09 +00:00
cyy	0fca51bcc4	[11/N] Fix Wextra-semi warning (#140926 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/140926 Approved by: https://github.com/ezyang	2024-11-20 00:32:45 +00:00
Natalia Gimelshein	0443398f5b	Implement deterministic scan (#140887 ) Fixes #89492 Uses block-wise cub primitives On large inputs, this implementation is approximately 25% slower than device cub implementation, so it's turned on only in cases where cub would have been (floating point inputs, cumsum that is effectively 1d) Pull Request resolved: https://github.com/pytorch/pytorch/pull/140887 Approved by: https://github.com/ezyang, https://github.com/kurtamohler	2024-11-19 23:43:26 +00:00
Benjamin Glass	6ccd35ccb8	cpp_wrapper: Fix searchsorted fallback ops (#140817 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/140817 Approved by: https://github.com/desertfire ghstack dependencies: #140624, #140634	2024-11-19 23:34:20 +00:00
Benjamin Glass	ce15d1ebc2	Narrow the scope of several cpp_wrapper test skips (#140634 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/140634 Approved by: https://github.com/desertfire ghstack dependencies: #140624	2024-11-19 23:34:20 +00:00
Benjamin Glass	34b2165bdb	Insert aten.add into fallback_ops, and fix Tensor -> Scalar conversion in ir.FallbackKernel (#140624 ) The code in ir.FallbackKernel will long-term be obviated by the solution for #90923. Closes #131334. Pull Request resolved: https://github.com/pytorch/pytorch/pull/140624 Approved by: https://github.com/desertfire	2024-11-19 23:34:20 +00:00
Siddharth Kotapati	9bc9d4cdb4	Fix MPS synchronize by waiting for root buffer to complete (#140725 ) Makes https://github.com/pytorch/pytorch/issues/139550#issuecomment-2468860559 work Pull Request resolved: https://github.com/pytorch/pytorch/pull/140725 Approved by: https://github.com/malfet, https://github.com/kulinseth	2024-11-19 23:10:24 +00:00
Joel Schlosser	780c580d68	General per-SampleInput xfail / skip system (#140443 ) ### Background This PR adds the functionality to xfail / skip on a per-`SampleInput` basis for `OpInfo` tests. See #89354 and #82669 for some requests asking for this type of functionality. This was originally landed for NJT in #138370 and is generalized and slightly tweaked here. ### Design #### Principles * Clean separation among `SampleInput` generation logic, test logic that uses the `SampleInput`s, and xfail / skip logic (which will change as bugs are addressed). * Flexibility in xfail / skip predicate specification - ideally each bug can be handled by a single skip / xfail, even if it surfaces across a specific class of ops. * This is important in practice for NJT, where it's common to have a bug that affects all binary ops, for example. * Opt-in with minimal test logic changes + no substantial impact on other tests. #### Details The core new concept is a `SampleRule`, which can be either an `XFailRule` or `SkipRule`. ```python @dataclass class SampleRule(ABC): # function to indicate whether the rule applies to this op; return True if so # NB: str arg of callable is device_type op_match_fn: Callable[[str, OpInfo], bool] = None # function to indicate whether the rule applies to this sample; return True if so sample_match_fn: Callable[[torch.device, SampleInput], bool] = None # optional name for identifying the rule name: str = "" @dataclass class XFailRule(SampleRule): # expected error type error_type: TypeVar = Exception # expected error message error_msg: str = "." @dataclass class SkipRule(SampleRule): ... ``` See below for example usage details, but at a high level: each test should have a corresponding list of `sample_skips_and_xfails`. * The list of `sample_skips_and_xfails` is traversed in order, and the first rule that matches (if any) is applied, so order can matter. * The PR includes a logging mechanism for matched rules accessible by setting the loglevel to `DEBUG`. * The split between `op_match_fn` and `sample_match_fn` is made to allow pre-filtering of the list of rules to get only those that apply to the op under test. * Each `SampleInput` is run within a subtest context so they can be individually skipped / xfailed as needed. This also means that a test will no longer stop after the first erroring `SampleInput`; all samples will be run through test logic. ### Example Usage Consider the following OpInfo test: ```python class MyTestCase(TestCase): @ops(op_db) def test_foo(self, device, dtype, op): for sample in op.sample_inputs(device, dtype, requires_grad=False): # do some SampleInput-based test logic output = op.op(sample.input, sample.args, sample.kwargs) ... ``` This is a common pattern for such tests; simply generate a list of `SampleInputs` and run them through the op. Now say you want to xfail one of these `SampleInput`s for a given op. Today, you have to xfail the entire test or hack around this in the test logic. This PR lets you do this to get very flexible xfail / skips based on op / sample input properties: ```python # NB: Define rules for per-SampleInput xfails / skips. These can also be defined in-line in the @ops decorator, but # it can be more readable to maintain these somewhere else. These are attempted to be matched in order and # the first one that matches applies, so order can matter. FOO_SKIPS_AND_XFAILS = [ XFailRule( error_type=ValueError, error_mg="2D inputs not supported", op_match_fn=lambda device, op: ( # NB: logic for which ops this rule applies to goes here op.full_name == "add" ), sample_match_fn=lambda device, sample: ( # NB: logic which samples this rule applies to goes here sample.input.dim() == 2 ), # NB: optional rule identifier can help with debugging matched rules name="add_with_2D_inputs_not_supported", ), # NB: This follows a similar structure as XFailRule but without error_type / error_msg. Obviously # this skips a particular SampleInput instead of xfailing :) SkipRule(...), ... ] class MyTestCase(TestCase): @ops(op_db) @sample_skips_and_xfails(FOO_SKIPS_AND_XFAILS) # NB: the @ops decorator automatically filters out any rules that don't apply to this op def test_foo(self, device, dtype, op): for sample, subtest_ctx in op.sample_inputs( # NB: use_subtests=True is required for skips / xfails to work. If skips / xfails are defined and use_subtests != True, # an informative error will be thrown. device, dtype, requires_grad=False, use_subtests=True ): # NB: this subtest context manager runs each sample input as a "subtest" and handles skips / xfails appropriately with subtest_ctx(self): # do some SampleInput-based test logic output = op.op(sample.input, sample.args, **sample.kwargs) ... ``` More examples can be seen in `test/test_nestedtensor.py`, where this system is used in practice. I also demonstrate usage of syntactic sugar over this system in `test/functorch/test_vmap.py`. Here, a skip for the `to()` operator is replaced with a granular xfail for `test_vmap_exhaustive()`: ```python ... # pre-existing xfail xfail("item"), # new granular xfail using syntactic sugar over the general system xfailIf( "to", lambda sample: ( sample.kwargs["memory_format"] == torch.channels_last ), ), ... ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/140443 Approved by: https://github.com/janeyx99, https://github.com/zou3519 ghstack dependencies: #140160, #138370	2024-11-19 23:09:38 +00:00
Nikita Shulga	cee3f8541e	[MPS][BE] Use `mtl_setBytes` to upload bools as is (#141037 ) But add static assert that size of bool is a single byte, to guard against hard to debug corruptions if someone decides to typedef it to int Fixes https://github.com/pytorch/pytorch/issues/140971 Pull Request resolved: https://github.com/pytorch/pytorch/pull/141037 Approved by: https://github.com/qqaatw, https://github.com/Skylion007	2024-11-19 23:08:43 +00:00
PyTorch MergeBot	9fac5a16fd	Revert "[PGNCCL] Add an API to get the status/error code of each PG (#140087 )" This reverts commit 80aa19a622bc6b159f7cf07b3501269f3356d752. Reverted https://github.com/pytorch/pytorch/pull/140087 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/140087#issuecomment-2486912231))	2024-11-19 22:53:46 +00:00
eellison	da069af0d4	[Easy] Refactor rsqrt lowering (#139944 ) The bool/int casting is equivalent to register_pointwise_numeric Pull Request resolved: https://github.com/pytorch/pytorch/pull/139944 Approved by: https://github.com/shunting314, https://github.com/blaine-rister	2024-11-19 22:51:42 +00:00
PyTorch MergeBot	496c1e78c5	Revert "Implements user buffer registration using MemPool (#133603 )" This reverts commit 25d9be37bef949c675e42b4929ddcb6997af2a7b. Reverted https://github.com/pytorch/pytorch/pull/133603 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/133603#issuecomment-2486897708))	2024-11-19 22:42:26 +00:00
Darshan Sanghani	32e93dfa92	[pytorch/profiler] Profiler NCCL metadata can now contain collective Input and Ouput Tensor addrs (#140637 ) Summary: Studying memory access patterns is the primary use cases. Differential Revision: D65918359 Pull Request resolved: https://github.com/pytorch/pytorch/pull/140637 Approved by: https://github.com/briancoutinho	2024-11-19 22:22:16 +00:00
Scott Wolchok	08cb5160b2	Extract reusable portions of GeluKernel into header (#140425 ) Makes the implementation reusable via header-only code sharing. (no diff for that yet, but we can commit the refactor regardless.) Testing: existing correctness tests should cover. Differential Revision: [D65608800](https://our.internmc.facebook.com/intern/diff/D65608800/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/140425 Approved by: https://github.com/ezyang	2024-11-19 22:00:01 +00:00
eellison	34e420519d	[Reland] dont decompose baddbmm (#141045 ) Previously the decomposition would upcasts inputs to fp32. This led to a slowdown compared to eager which would run in fp16. We also tried keeping the bmm in fp16, and the upcasting for the epilogue but that led to worse numerics because the bmm in eager would do the epilogue all in fp32 without a downcast in the bmm accumulator. Fix for https://github.com/pytorch/pytorch/issues/137897 Reland of https://github.com/pytorch/pytorch/pull/137904 Pull Request resolved: https://github.com/pytorch/pytorch/pull/141045 Approved by: https://github.com/BoyuanFeng	2024-11-19 21:07:58 +00:00
Scott Wolchok	f30f43f594	Use std::bit_cast as c10::bit_cast if available (#141035 ) Make what we're doing as obvious as possible to the compiler. Differential Revision: [D66108811](https://our.internmc.facebook.com/intern/diff/D66108811/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/141035 Approved by: https://github.com/Skylion007, https://github.com/ezyang, https://github.com/malfet ghstack dependencies: #140564, #140565, #140566, #140567, #140720, #140994	2024-11-19 20:43:45 +00:00
Animesh Jain	f4ce9ac29d	[dynamo] Dont erase the cache line on invalidation (#140821 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/140821 Approved by: https://github.com/jansel	2024-11-19 19:11:10 +00:00
Scott Wolchok	efed02b990	Fix Half X86_F16 CUDA build failure (#140994 ) It passed PyTorch CI, but internally we saw failures from this. Differential Revision: [D66137897](https://our.internmc.facebook.com/intern/diff/D66137897/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/140994 Approved by: https://github.com/malfet ghstack dependencies: #140564, #140565, #140566, #140567, #140720	2024-11-19 19:02:21 +00:00
Henry Tsang	4f2543c31d	[logs] Add dynamo_timed to get better compilation time breakdown for AOTI (#140198 ) Adding some dynamo timed for the purpose of better understanding AOTI compilation time. Probably would require a few more passes. A lot of time is spent in Scheduler.__init__, and not enough annotations are there. run_command_and_check takes a lot time as well. But there is probably not much we can do. Maybe we can add a config to tune C++ optimization level? traces: <img width="1205" alt="Screenshot 2024-11-08 at 4 41 10 PM" src="https://github.com/user-attachments/assets/61645264-b3af-4d4a-804d-700b0f831c7c"> Differential Revision: D65554141 Pull Request resolved: https://github.com/pytorch/pytorch/pull/140198 Approved by: https://github.com/desertfire	2024-11-19 18:54:17 +00:00
PyTorch MergeBot	7f10351ba0	Revert "Implement deterministic scan (#140887 )" This reverts commit 4eed438a42a054a63b5e0a7225dd0e84cf488a96. Reverted https://github.com/pytorch/pytorch/pull/140887 on behalf of https://github.com/ngimel due to breaks with 11.4 ([comment](https://github.com/pytorch/pytorch/pull/140887#issuecomment-2486409438))	2024-11-19 18:08:48 +00:00
PyTorch MergeBot	d276688da6	Revert "[dynamo][guards] Consider tensors as immutable for dict tag matches (#139560 )" This reverts commit b09eb6ed6a22476746d8b7d5f6e464e34f89747a. Reverted https://github.com/pytorch/pytorch/pull/139560 on behalf of https://github.com/anijain2305 due to internal test failures ([comment](https://github.com/pytorch/pytorch/pull/139560#issuecomment-2486344859))	2024-11-19 17:37:44 +00:00
Guilherme Leobas	7ced49d2cc	Raise exception if vmap (eager) calls compiled function (#140439 ) Fixes #138422 This is not a proper fix for #140439, but more of a way to prevent a user from seeing a nasty error inside the C++ code. Pull Request resolved: https://github.com/pytorch/pytorch/pull/140439 Approved by: https://github.com/zou3519	2024-11-19 16:27:48 +00:00
atalman	99a03211cb	Deprecate conda nightly builds (#141024 ) Removing CD as per https://github.com/pytorch/pytorch/issues/138506 Pull Request resolved: https://github.com/pytorch/pytorch/pull/141024 Approved by: https://github.com/malfet	2024-11-19 16:09:54 +00:00
Tugsbayasgalan Manlaibaatar	2b21a653d8	Register CIA ops to FakeTensorMode directly in export (#140465 ) During export, we nub out most CIA ops to return NotImplemented to avoid decomposing them during tracing. To recover the existing shape propagation behavior, we register these CIA decomps directly as FakeTensorMode rules as well. The reason we have to do is because when we return NotImplemented, FakeTensor would fallback to running these CIAs with Meta backend causing device branching CIA ops to fail. (because now the device is Meta. One example is sdpa). If we register a kernel directly to FakeTensorMode, we won't fallback to Meta backend. Differential Revision: [D65716260](https://our.internmc.facebook.com/intern/diff/D65716260/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/140465 Approved by: https://github.com/bdhirsh	2024-11-19 15:00:35 +00:00
YangQuan	93aef684d9	fix typo in `torch.compiler_dynamo_deepdive.rst` (#140871 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/140871 Approved by: https://github.com/zou3519	2024-11-19 14:42:36 +00:00
redwrasse	260d1dcef4	Check torch.linalg.qr differentiability as documented (#135097 ) Expands the `test_linalg_qr_autograd_errors` unit test to check all cases of differentiablity/non-differentiability as given in the docs https://pytorch.org/docs/stable/generated/torch.linalg.qr.html: - mode= ‘reduced’ (default): Returns (Q, R) of shapes (, m, k), (, k, n) respectively. It is always differentiable. - mode= ‘complete’: Returns (Q, R) of shapes (, m, m), (, m, n) respectively. It is differentiable for m <= n. - mode= ‘r’: Computes only the reduced R. Returns (Q, R) with Q empty and R of shape (*, k, n). It is never differentiable. (in particular, the happy paths are added) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135097 Approved by: https://github.com/IvanYashchuk, https://github.com/nikitaved	2024-11-19 12:25:39 +00:00
eellison	0c7c5d78fa	[inductor] add support for TRITON_INTERPRET (#140841 ) Was debugging the issue lower in the stack and found this to be helpful / quick enough to add. Fix for https://github.com/pytorch/pytorch/issues/123956 Pull Request resolved: https://github.com/pytorch/pytorch/pull/140841 Approved by: https://github.com/exclamaforte	2024-11-19 11:24:13 +00:00
Nikita Shulga	f0f6144381	[EZ][BE] Update googletest submodule (#140988 ) From v1.11.0 (released in Jun 2021) to v1.15.2 (release in Jul 2024) Pull Request resolved: https://github.com/pytorch/pytorch/pull/140988 Approved by: https://github.com/izaitsevfb, https://github.com/huydhn	2024-11-19 07:49:16 +00:00
Yu Guo	808da50c2d	create a new torch.cuda.device_memory_used api (#140870 ) Summary: the current torch.cuda.memory_usage returns the memory utilization, more specifically, percent of time over the past sample period global memory being read/written for Nvidia. see more details in https://github.com/pytorch/pytorch/issues/140638 Test Plan: added a new unittest Differential Revision: D65960134 Pull Request resolved: https://github.com/pytorch/pytorch/pull/140870 Approved by: https://github.com/ngimel, https://github.com/eqy	2024-11-19 06:36:30 +00:00
Prachi Gupta	7156d0824d	[ROCm] Fix largeIndexBlockSize (#139087 ) On ROCm, hipification converts std::min to ::min, but ::min is not returning the right result. This impacts index_add_ operation on a large tensor, we end up picking the large values instead of max supported block size (128). This leads to GPU accessing memory out of bounds. While we wait for ::min to be fixed, we can use < operator to compare instead of relying on ::min. Example Code w/ failure: ``` D=6144 hidden_states = torch.zeros([16384, 6144], device="cuda:0", dtype=torch.bfloat16) index = torch.randint(0, 16384, (1, 32, 16384), device="cuda:0", dtype=torch.int64) output = torch.empty([1, 32, 16384, 6144], device="cuda:0", dtype=torch.bfloat16) hidden_states.index_add_(0, index.view(-1), output.view(-1, D)) ``` ``` Traceback (most recent call last): RuntimeError: HIP error: invalid configuration argument ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/139087 Approved by: https://github.com/jeffdaily, https://github.com/pruthvistony	2024-11-19 06:29:48 +00:00
Ke Wen	115f15a255	[PGNCCL][EZ] Do not use same name as NCCL API (#140997 ) `ncclCommAbort` is an API name of NCCL. Do not use the same name for `NCCLComm`'s method. Pull Request resolved: https://github.com/pytorch/pytorch/pull/140997 Approved by: https://github.com/fegin, https://github.com/wconstab	2024-11-19 05:40:39 +00:00
chuanqiw	1bdb9ddc70	[CD] Upgrade XPU support packages version to 2025.0 (#140373 ) Depends on https://github.com/pytorch/pytorch/pull/139775 Works for https://github.com/pytorch/pytorch/issues/139722 and https://github.com/pytorch/pytorch/issues/114850 Pull Request resolved: https://github.com/pytorch/pytorch/pull/140373 Approved by: https://github.com/atalman, https://github.com/malfet	2024-11-19 05:16:46 +00:00
Chirag Pandya	8bc4033814	[fr][ez] better log messages + minor fixups (#140969 ) Summary: 1. Clearly specify error messages that we are refering to a collective_sequence_id and an internal_record id for entry. The entry id is semi-useless for the end consumer so at least let them know that this is an internal record id. 2. Add some missing fields in types.py. self.missing_ranks = set() self.input_numel = tuple() self.output_numel = tuple() self.errors = set() These were showing up as linter errors when I opened the file in vs-code Test Plan: ``` buck2 run //caffe2/fb/flight_recorder:fr_trace -- -m f665492593-nerf_training-96ab95e0 -w 8 --mast_job_version 0 -a 0 Buck UI: https://www.internalfb.com/buck2/2cac9273-1b7b-47bf-867f-82f9a4c1d581 Network: Up: 0B Down: 0B Not all ranks joining collective: sequence number: 31117 internal record id: 31116 group info: 0:default_pg collective: nccl:all_reduce missing ranks: {3, 4, 5, 6, 7} input sizes: [[1571911]] output sizes: [[1571911]] world size: 8 expected ranks: {0, 1, 2, 3, 4, 5, 6, 7} collective state: scheduled collective stack trace: all_reduce at /packages/fblearner.flow.canary/workflow#link-tree/torch/distributed/distributed_c10d.py:2707 wrapper at /packages/fblearner.flow.canary/workflow#link-tree/torch/distributed/c10d_logger.py:81 sync_buffers at /packages/fblearner.flow.canary/workflow#link-tree/xri_mapsr/neural_fields/models/gaussian_splatting.py:650 decorate_context at /packages/fblearner.flow.canary/workflow#link-tree/torch/utils/_contextlib.py:116 step at /packages/fblearner.flow.canary/workflow#link-tree/xri_mapsr/neural_fields/training/training_manager/splatting.py:356 main at /packages/fblearner.flow.canary/workflow#link-tree/xri_mapsr/neural_fields/nerf_training.py:260 main_impl at /packages/fblearner.flow.canary/workflow#link-tree/rl_aiep/mast/endpoint.py:57 main at /packages/fblearner.flow.canary/workflow#link-tree/rl_aiep/mast/endpoint.py:34 wrapper at /packages/fblearner.flow.canary/workflow#link-tree/torch/distributed/elastic/multiprocessing/errors/__init__.py:355 <module> at /packages/fblearner.flow.canary/workflow#link-tree/rl_aiep/mast/endpoint.py:118 _run_code at /packages/fblearner.flow.canary/workflow#link-tree/runtime/lib/python3.10/runpy.py:86 _run_module_as_main at /packages/fblearner.flow.canary/workflow#link-tree/runtime/lib/python3.10/runpy.py:196 run_as_main at /packages/fblearner.flow.canary/workflow#link-tree/__par__/bootstrap.py:69 run_as_main at /packages/fblearner.flow.canary/workflow#link-tree/__par__/meta_only/bootstrap.py:98 __invoke_main at /packages/fblearner.flow.canary/workflow#link-tree/__run_lpar_main__.py:28 <module> at /packages/fblearner.flow.canary/workflow#link-tree/__run_lpar_main__.py:31 ... Differential Revision: D66018461 Pull Request resolved: https://github.com/pytorch/pytorch/pull/140969 Approved by: https://github.com/Skylion007, https://github.com/fduwjj	2024-11-19 04:39:16 +00:00
sifengyang	51d4338716	fix test_save_load_transform. (#140494 ) test test_save_load_transform in [test_transforms.py](https://github.com/pytorch/pytorch/blob/main/test/distributions/test_transforms.py) _pytest test_transforms.py -k test_save_load_transform_ error message: ``` . . . File "/workspace/pytorch/test/distributions/test_transforms.py", line 555, in test_save_load_transform other = torch.load(stream) ^^^^^^^^^^^^^^^^^^ File "/opt/conda/lib/python3.11/site-packages/torch/serialization.py", line 1444, in load raise pickle.UnpicklingError(_get_wo_message(str(e))) from None _pickle.UnpicklingError: Weights only load failed. This file can still be loaded, to do so you have two options, do those steps only if you trust the source of the checkpoint. (1) Re-running `torch.load` with `weights_only` set to `False` will likely succeed, but it can result in arbitrary code execution. Do it only if you got the file from a trusted source. (2) Alternatively, to load with `weights_only=True` please check the recommended steps in the following error message. WeightsUnpickler error: Unsupported global: GLOBAL torch.distributions.transformed_distribution.TransformedDistribution was not an allowed global by default. Please use `torch.serialization.add_safe_globals([TransformedDistribution])` or the `torch.serialization.safe_globals([TransformedDistribution])` context manager to allowlist this global if you trust this class/function. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/140494 Approved by: https://github.com/mikaylagawarecki	2024-11-19 04:36:06 +00:00
PyTorch MergeBot	d472a5f680	Revert "[inductor] Refactor MutableBox to make IRNode typing easier (#140895 )" This reverts commit c79e78b5034198f9d6801b4fef710b9b9b0e9193. Reverted https://github.com/pytorch/pytorch/pull/140895 on behalf of https://github.com/huydhn due to Sorry for reverting your change but I think test_torchbind_inductor is failing in trunk after this lands ([comment](https://github.com/pytorch/pytorch/pull/140895#issuecomment-2484679319))	2024-11-19 04:25:41 +00:00
cyy	00b3b61076	Add and use thread-safe strerror (#140472 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/140472 Approved by: https://github.com/ezyang	2024-11-19 04:24:17 +00:00
Nikita Shulga	a10ce22577	[BE] Update bazelisk and bazel versions (#140992 ) bazelisk from 1.16 to 1.23 bazel from 6.1.1 to 6.5.0 Pull Request resolved: https://github.com/pytorch/pytorch/pull/140992 Approved by: https://github.com/izaitsevfb, https://github.com/huydhn	2024-11-19 03:40:53 +00:00
Yidi Wu	0fcd024f59	[hop] refactor only_consist_of with find_mismatched_vars (#140105 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/140105 Approved by: https://github.com/zou3519	2024-11-19 03:21:16 +00:00
Ke Wen	70a0906f24	[c10d] Support optional backend if device_id provided (#140963 ) Citing @malfet's [comment](https://github.com/pytorch/pytorch/pull/136343#pullrequestreview-2318792396) in https://github.com/pytorch/pytorch/pull/136343 > It would be great, if users do not have to modify their programs for every new backend, but rather use with torch.device('xpu'): and keep rest of the code unchanged. This PR makes the backend specification ("nccl", "gloo") optional when user provides a `devce_id` to `init_process_group` (the acceptance of `device_id` has been previously supported for the purpose of eager init). New user experience: ``` device = torch.device(device_type, rank % device_count) dist.init_process_group(device_id=device) ``` The line of `device = torch.device(...)` is anyway needed because user would use it for tensor creation etc. Pull Request resolved: https://github.com/pytorch/pytorch/pull/140963 Approved by: https://github.com/wconstab	2024-11-19 03:17:29 +00:00
Mikayla Gawarecki	37959c554d	Add small test case for #140230 (#140850 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/140850 Approved by: https://github.com/malfet ghstack dependencies: #140739, #140740	2024-11-19 02:44:54 +00:00
Mikayla Gawarecki	f3f305ef3e	Fix condition for weights_only unpickler for DTensor (#140740 ) Same as #140739 but for DTensor (move safe globals for DTensor to `torch.distributed.tensor.__init__` and update error message to let user know `torch.distributed.tensor` must be imported to load DTensor) Differential Revision: [D65961690](https://our.internmc.facebook.com/intern/diff/D65961690) Pull Request resolved: https://github.com/pytorch/pytorch/pull/140740 Approved by: https://github.com/malfet ghstack dependencies: #140739	2024-11-19 02:44:53 +00:00
Mikayla Gawarecki	b63a84804c	Allow NJT by default for weights_only torch.load (take 2) (#140739 ) Per discussion with @malfet, only allow weights_only unpickler to load NJT if `torch.nested` and `torch._dynamo` are imported (this is slightly weird as technically `torch.nested` is actually imported by default and `torch._dynamo.decorators._DimRange` is actually what needs to be imported) we can't import this from `torch.nested` as this would - undo dynamo lazy import - cause circular import =========================== Redo of https://github.com/pytorch/pytorch/pull/140304 caused issues as `torch.nested._internal.foo` needs to be imported, which causes issues like ```python torch/_weights_only_unpickler.py", line 339, in load if full_path in _get_allowed_globals(): torch/_weights_only_unpickler.py", line 188, in _get_allowed_globals torch.nested._internal.nested_tensor.NestedTensor AttributeError: module 'torch.nested' has no attribute '_internal' ``` This likely wasn't caught in our CI because imports are global during unit tests(?), so we use subprocess to properly test this time Differential Revision: [D65961691](https://our.internmc.facebook.com/intern/diff/D65961691) @jbschlosser Pull Request resolved: https://github.com/pytorch/pytorch/pull/140739 Approved by: https://github.com/malfet	2024-11-19 02:44:53 +00:00
Prajesh Praveen Anchalia	1e234e63b3	[pytorch][dynamo_compile] Log inductor config to dynamo_compile (#140790 ) Summary: Scrubbed inductor config logging to dynamo_compile as json:str. Scrub RE: `r'((^TYPE_CHECKING$)\|(._progress$)\|(.TESTING.)\|(.(rocm\|halide).)\|(^trace\..)\|(^_))'`to save some space. Test Plan: Staging logger: https://fburl.com/data/ltkt08zm P1679697917 {F1958428018} Differential Revision: D65806399 Pull Request resolved: https://github.com/pytorch/pytorch/pull/140790 Approved by: https://github.com/masnesral	2024-11-19 02:39:33 +00:00
Brian Hirsh	9ae19ffbed	fix layer_norm decomp precision for cpu (#140557 ) xref: https://fb.workplace.com/groups/1075192433118967/posts/1540519826586223/?comment_id=1543752356262970&reply_comment_id=1544425069529032 the issue is that our decomp needs to branch on device (it only upcasts for cpu), but the device shows up as "meta" because it is registered as a meta tensor rule. Pull Request resolved: https://github.com/pytorch/pytorch/pull/140557 Approved by: https://github.com/ezyang	2024-11-19 02:31:31 +00:00
Max Ren	240aa77ad0	[Quantizer][XNNPACK] Fix ReLU fusion when conv/linear has > 1 user (#140846 ) Summary: Bug in quantizer when Conv + ReLU is fused even when the preceeding conv has more than one user. Conv and ReLU can not be fused in this case because the result of Conv must be used elsewhere. XNNPACK Delegate naturally handles this by inserting a clamp node for ReLU. Test Plan: CI Reviewed By: digantdesai Differential Revision: D65989599 Pull Request resolved: https://github.com/pytorch/pytorch/pull/140846 Approved by: https://github.com/digantdesai	2024-11-19 02:29:45 +00:00
Tristan Rice	2673a440d0	[distributed] add PG APIs and general doc cleanups (#140853 ) Doc updates: * This adds documentation for the object oriented ProcessGroup APIs that are being used in torchft as well as https://github.com/pytorch/rfcs/pull/71 . * It also does some general cleanups to simplify the distributed.rst by using `:methods`. * It adds `__init__` definitions for the Stores * I've reordered things so the collective APIs are before the Store/PG apis Test plan: ``` lintrunner -a cd docs && sphinx-autobuild source build/ -j auto -WT --keep-going ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/140853 Approved by: https://github.com/kwen2501	2024-11-19 02:06:32 +00:00
zeshengzong	5b326d6b61	Add gdb print methods support same as pytorch-lldb (#140935 ) `pytorch-lldb` support pretty printing size and key_set of tensor via #97101 Add same pretty printing for gdb debugging. Test Result ```bash $ gdb python (gdb) break at::native::negative (gdb) r >>> import torch >>> t = torch.tensor([1, 2, 3, 4], dtype=torch.float64) >>> t.negative() Thread 1 "python" hit Breakpoint 1, at::native::negative (self=...) at /home/zong/code/pytorch/aten/src/ATen/native/UnaryOps.cpp:854 854 Tensor negative(const Tensor& self) { return self.neg(); } ``` Before ```bash (gdb) p self.key_set() $2 = {repr_ = 1271310352385} (gdb) p self.sizes() $3 = {Data = 0x9cb488, Length = 1} ``` After ```bash (gdb) torch-int-array-ref-repr self.sizes() [4] (gdb) torch-dispatch-keyset-repr self.key_set() DispatchKeySet(CPU, ADInplaceOrView, AutogradCPU, AutocastCPU) ``` ```bash $ lintrunner ``` ![image](https://github.com/user-attachments/assets/b720e284-13b1-4581-ae3a-963f6482fdb2) Pull Request resolved: https://github.com/pytorch/pytorch/pull/140935 Approved by: https://github.com/drisspg	2024-11-19 01:28:30 +00:00
Will Constable	98e6e69b1b	[C10D] Support group_dst/group_src in c10d send/recv object_list (#140847 ) Also add mypy annotations Partially addresses RFC 0042 (https://github.com/pytorch/rfcs/pull/71) See more details/motivation in https://github.com/pytorch/pytorch/pull/140460 Pull Request resolved: https://github.com/pytorch/pytorch/pull/140847 Approved by: https://github.com/H-Huang ghstack dependencies: #140843	2024-11-19 01:23:08 +00:00
Will Constable	c82c46ccc7	[C10D] support group_src/dst in broadcast/reduce ops (#140843 ) Also add mypy annotations Partially addresses RFC 0042 (pytorch/rfcs#71) See more details/motivation in #140460 Pull Request resolved: https://github.com/pytorch/pytorch/pull/140843 Approved by: https://github.com/kwen2501	2024-11-19 01:23:08 +00:00
Shen Xu	efe8482c0d	Add prepare_obs_or_fq_callback to quantizer (#140863 ) Test Plan: CI. Differential Revision: D65982003 Pull Request resolved: https://github.com/pytorch/pytorch/pull/140863 Approved by: https://github.com/jerryzh168	2024-11-19 01:13:38 +00:00
Jason Ansel	c79e78b503	[inductor] Refactor MutableBox to make IRNode typing easier (#140895 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/140895 Approved by: https://github.com/ezyang, https://github.com/Skylion007	2024-11-19 00:24:35 +00:00
Ryan Guo	98e441f00b	[dynamo] Simplify `ConstantVariable.create` and `ConstantVariable.__init__` (#140745 ) This patch removes some redundant code paths in `ConstantVariable.create` and` ConstantVariable.__init__`. Closes #110871. Pull Request resolved: https://github.com/pytorch/pytorch/pull/140745 Approved by: https://github.com/jansel	2024-11-19 00:22:50 +00:00
Ryan Guo	2da98d9757	[dynamo] Support `is` comparison for symnodes (#140754 ) Fixes #109504. Pull Request resolved: https://github.com/pytorch/pytorch/pull/140754 Approved by: https://github.com/williamwen42	2024-11-19 00:19:33 +00:00
Yang Wang	175ba9fed6	[Utilization Monitor] input to disable utilization monitor (#140857 ) # Overview Currently monitor.py produces error only result, this pr introduct disable-monitor option to all *-test.yml. We also like to explore how the monitor code affect benchmark results. # next steps - fix the monitor.py - enable non-benchmark tests with monitor - investigate benchmark test behavior with monitor background job Pull Request resolved: https://github.com/pytorch/pytorch/pull/140857 Approved by: https://github.com/huydhn	2024-11-18 23:26:03 +00:00
Yukio Siraichi	48a276c5a0	`log_softmax`: fix meta function output argument dtype check. (#140289 ) Tracking issue: #138399 Pull Request resolved: https://github.com/pytorch/pytorch/pull/140289 Approved by: https://github.com/ezyang ghstack dependencies: #140186, #140286, #140288	2024-11-18 23:05:29 +00:00
Yukio Siraichi	435286e985	Fix unary references' out dtype check. (#140288 ) Tracking issue: #138399 This PR fixes a number of reference implementations (which are also used as meta functions), making them more consistent with CPU device. More specifically, it fixes those operations that use `_make_elementwise_unary_reference` decorator, and don't error on mismatching out argument dtype while they error when using concrete devices (e.g. CPU). The fixed operations are: - `abs` - `ceil` - `floor` - `frac` - `isneginf` - `isposinf` - `sgn` - `sign` - `signbit` - `trunc` Pull Request resolved: https://github.com/pytorch/pytorch/pull/140288 Approved by: https://github.com/ezyang ghstack dependencies: #140186, #140286	2024-11-18 23:05:29 +00:00
PyTorch MergeBot	727f1a6da9	Revert "FlopCounterMode: Decompose ops for inference mode (#138508 )" This reverts commit f915409c26c0ba38b286c7b617880af61a6b08ba. Reverted https://github.com/pytorch/pytorch/pull/138508 on behalf of https://github.com/jamesjwu due to Failing internal jobs ([comment](https://github.com/pytorch/pytorch/pull/138508#issuecomment-2484310587))	2024-11-18 22:59:36 +00:00
James Wu	8d5b3eeaa6	Remove __start__ stack, log backward compile to empty stack (#140431 ) Summary: This diff removes "__start__" from all stacks in Pt2 Compile Events, as it's unnecessary. It also starts logging events for backward compile, because otherwise we have no toplevel event representing full backward compilation. This gives us a toplevel event outside of the inductor compile. Test Plan: New chromium events: https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html?url=https%3A%2F%2Finterncache-all.fbcdn.net%2Fmanifold%2Ftlparse_reports%2Ftree%2Flogs%2Fjjwu%2Fcustom%2Fstuff4%2Fchromium_events.json#!/viewer?url=https%3A%2F%2Finterncache-all.fbcdn.net%2Fmanifold%2Ftlparse_reports%2Ftree%2Flogs%2Fjjwu%2Fcustom%2Fstuff4%2Fchromium_events.json&local_cache_key New tlparse: https://interncache-all.fbcdn.net/manifold/tlparse_reports/tree/logs/jjwu/custom/stuff4/index.html New scuba icicle view, still good: https://fburl.com/scuba/pt2_compile_events/z6gr3z53 Differential Revision: D65832045 Pull Request resolved: https://github.com/pytorch/pytorch/pull/140431 Approved by: https://github.com/masnesral	2024-11-18 22:48:31 +00:00
titaiwangms	8e439021c1	[ONNX] Support from dynamic_shapes to dynamic_axes when torch.onnx.export(fallback=True) is triggered (#139532 ) Fixes #139320 ### Summary: #### (1) Add `_rename_dynamic_shapes_with_model_inputs` for dynamic_shapes to play along with input_names * Use model forward signature to rename dynamic_shapes when dynamic_shapes is not nested and dynamic_shapes is directly using the customized name. This solves the issue that torch.export.export expects dynamic_shapes only uses the model input names. * If the dynamic_shapes is nested, we do nothing. #### (2) Add `_from_dynamic_shapes_to_dynamic_axes` for fallback * We flatten dynamic_shapes with leaf defined _pytree.tree_leaves() ~~* If a dynamic_shapes is not nested, and defined in dict. We can use the key as the input_names, since it should be renamed by `_rename_dynamic_shapes_with_model_inputs` already.~~ * If a dynamic_shapes is provided, input_names is required to assign the names, because dynamic_axes needs it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139532 Approved by: https://github.com/justinchuby	2024-11-18 22:35:21 +00:00
William Wen	72943ba823	[3.13] deal with exec() semantic change in test_cond_no_dynamo_cache_limit (#140401 ) https://peps.python.org/pep-0667/ changed the semantics of `eval/exec` in 3.13 so that changes to locals no longer propagate (but globals do). This is to make the behavior predictable since in the past, the locals may or may not update based on various mysterious conditions. Other test sites may need updating too. Pull Request resolved: https://github.com/pytorch/pytorch/pull/140401 Approved by: https://github.com/ydwu4, https://github.com/zou3519	2024-11-18 22:06:47 +00:00
titaiwangms	e445239bb4	[ONNX] Fix 2GB exporting crash during onnx shape type inference (#140962 ) Fixes https://github.com/pytorch/pytorch/issues/132205 Regression happened after https://github.com/pytorch/pytorch/pull/128675 that ONNX shape type inference error stops the exporting process during shape type inference. ONNX shape type inference during the export only does it's best to fulfill the information, and should not crash the export. Pull Request resolved: https://github.com/pytorch/pytorch/pull/140962 Approved by: https://github.com/justinchuby	2024-11-18 21:50:23 +00:00
cyy	8cd7ad8b48	[Reland][Environment Variable][5/N] Use thread-safe getenv functions (#140594 ) Reland of #139762 with no bug found. Pull Request resolved: https://github.com/pytorch/pytorch/pull/140594 Approved by: https://github.com/ezyang	2024-11-18 21:45:35 +00:00
Catherine Lee	c62da98c1a	Upload all run attempts when in upload_test_stats_intermediate (#140459 ) Upload all run attempts since it can be hard to determine which run attempt to do from HUD, since HUD shows everything together Pull Request resolved: https://github.com/pytorch/pytorch/pull/140459 Approved by: https://github.com/huydhn	2024-11-18 21:40:10 +00:00
Scott Wolchok	17bb78a3d3	Port X86_F16 from executorch half to PyTorch half (#140720 ) This was added in https://github.com/pytorch/executorch/pull/1789 . I'm working on sharing Half.h with ExecuTorch, and this is a missing feature. Differential Revision: [D65949409](https://our.internmc.facebook.com/intern/diff/D65949409/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/140720 Approved by: https://github.com/malfet ghstack dependencies: #140564, #140565, #140566, #140567	2024-11-18 21:32:44 +00:00
PyTorch MergeBot	43de32d948	Revert "create a new torch.cuda.device_memory_used api (#140870 )" This reverts commit 478204cad68651960a979ca109e2bd4a219b0f1a. Reverted https://github.com/pytorch/pytorch/pull/140870 on behalf of https://github.com/yuguo68 due to the test is still flaky on ROCm, test_cuda.py::TestCudaMallocAsync is not skipped with the unittest.skipIf(TEST_CUDAMALLOCASYNC ([comment](https://github.com/pytorch/pytorch/pull/140870#issuecomment-2484161914))	2024-11-18 21:26:25 +00:00
Yuanhao Ji	4bb1bf0573	[Docs] Remove duplicate declaration of `double_tensor` (#140927 ) Fixes #140920 Pull Request resolved: https://github.com/pytorch/pytorch/pull/140927 Approved by: https://github.com/malfet	2024-11-18 21:22:30 +00:00
Nikita Shulga	e46af7de0c	[MPS] [BE] Use direct call vs virtual (#140950 ) I.e. replace `at::detail::getMPSHooks().isOnMacOSorNewer` with `is_macos_13_or_newer`, which is a direct function call instead of going thru a virtual method call Hooks are only needed to provide a feature-agnostic inteface to query something even on the platforms that might not have support for the featuee, while functions implemented in `ATen/native/xxx` should be able to call those platform specific methods directly Pull Request resolved: https://github.com/pytorch/pytorch/pull/140950 Approved by: https://github.com/Skylion007 ghstack dependencies: #140896	2024-11-18 21:01:52 +00:00
Natalia Gimelshein	4eed438a42	Implement deterministic scan (#140887 ) Fixes #89492 Uses block-wise cub primitives On large inputs, this implementation is approximately 25% slower than device cub implementation, so it's turned on only in cases where cub would have been (floating point inputs, cumsum that is effectively 1d) Pull Request resolved: https://github.com/pytorch/pytorch/pull/140887 Approved by: https://github.com/ezyang, https://github.com/kurtamohler	2024-11-18 20:56:14 +00:00
Basil Wong	00c829876c	Log Full Knapsack Problem Information (#140757 ) Summary: When AOT_PARTITIONER_DEBUG is set to 1 and debug logging is turned on we can now log the full input and output for each knapsack problem. Differential Revision: D65633086 Pull Request resolved: https://github.com/pytorch/pytorch/pull/140757 Approved by: https://github.com/jansel	2024-11-18 20:36:32 +00:00
Nikita Shulga	408ad45014	[MPS][BE] Introduce `mtl_setArgs` (#140896 ) Which is a variadic template that automates tedious (and error prone) process of pasing the arguments via series of ```cpp mtl_setBuffer(encoder, b1, 0); mtl_setBuffer(encoder, b2, 1); mtl_setBytes(encoder, param, 2); ``` into a compact ``` mtl_setArgs(encoder, b1, b2, param); ``` Introduce few more specialization of `mps_setArg`, such as: - Call `setBuffer` for `id<MTLBuffer>` - Copy double as float (as MPS does not support double precision types) - Accept `std::optional<at::Tensor>` that will not call setBuffet, if optional is empty Also, re-metaprogramm `mtl_setBytes` to make it usable with any trivially copiable structs, but keep separate implementation for containers, as uploading `c10:SmallVector`, which is trivially copiable would overwrite next arguments, which luckily resulted in test failures of `test_cross_entropy_label_smoothing_weight_ignore_indices_mps` Introduce `has_size_type_v` which could be used to diferrentiate between trivially copiable `std::array` and `c10::ArrayRef` vs other trivially copiable structs. ```cpp template <typename T> class has_size_type { template <typename U> static constexpr std::true_type check(typename U::size_type*); template <typename> static constexpr std::false_type check(...); public: static constexpr bool value = decltype(check<T>(nullptr))::value; }; template <typename T> constexpr bool has_size_type_v = has_size_type<T>::value; ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/140896 Approved by: https://github.com/Skylion007	2024-11-18 20:35:01 +00:00
Joel Schlosser	e80b1b2870	Flex + NJT: cross attention support (#140723 ) Fixes #140598 Allows ragged structures for query and key+value sequence lengths to differ (i.e. supports cross attention for Flex + NJT). Technically, this is BC-breaking thanks to arg renaming and positional arg reordering in `create_nested_block_mask()`, but Flex + NJT support isn't in a major release yet so I'm hoping we can just do it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/140723 Approved by: https://github.com/drisspg	2024-11-18 19:49:45 +00:00
Yu Guo	478204cad6	create a new torch.cuda.device_memory_used api (#140870 ) Summary: the current torch.cuda.memory_usage returns the memory utilization, more specifically, percent of time over the past sample period global memory being read/written for Nvidia. see more details in https://github.com/pytorch/pytorch/issues/140638 Test Plan: added a new unittest Differential Revision: D65960134 Pull Request resolved: https://github.com/pytorch/pytorch/pull/140870 Approved by: https://github.com/ngimel	2024-11-18 19:13:43 +00:00
Scott Wolchok	081c1687c8	Remove UB type punning from c10/util/floating_point_utils.h (#140567 ) Accessing the inactive member of a union is undefined behavior. Fortunately, we have c10::bit_cast. Differential Revision: [D65888680](https://our.internmc.facebook.com/intern/diff/D65888680/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/140567 Approved by: https://github.com/Skylion007, https://github.com/malfet ghstack dependencies: #140564, #140565, #140566	2024-11-18 18:41:34 +00:00
Scott Wolchok	f59ec98ceb	Add C10_EMBEDDED to gate ostream usage in Half/BFloat16 (#140566 ) We want to use Half/BFloat16 in ExecuTorch to support shared kernel code. They will need to be used in ExecuTorch core, so they can't have streams. This diff introduces a macro to gate the stream code off. Differential Revision: [D65888035](https://our.internmc.facebook.com/intern/diff/D65888035/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/140566 Approved by: https://github.com/ezyang, https://github.com/malfet ghstack dependencies: #140564, #140565	2024-11-18 18:41:34 +00:00
FFFrog	0f1a88cfba	Make Context to be Device-agnostic Step by Step (2/N) (#136526 ) ---- - add new method(getDefaultGenerator, getNewGenerator) into AcceleratorHooksInterface Pull Request resolved: https://github.com/pytorch/pytorch/pull/136526 Approved by: https://github.com/ezyang, https://github.com/EikanWang	2024-11-18 18:21:17 +00:00
Max Ren	cca34be584	Update XNNPACK Version (#139913 ) Updating XNNPACK Version to 4ea82e595b36106653175dcb04b2aa532660d0d8 submodule update Pull Request resolved: https://github.com/pytorch/pytorch/pull/139913 Approved by: https://github.com/digantdesai, https://github.com/huydhn	2024-11-18 18:16:31 +00:00
Scott Wolchok	e429a3b72e	Move complex<Half> from Half.h to complex.h (#140565 ) Executing on old TODO on the way to sharing Half.h with ExecuTorch. Differential Revision: [D65888037](https://our.internmc.facebook.com/intern/diff/D65888037/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/140565 Approved by: https://github.com/ezyang, https://github.com/malfet ghstack dependencies: #140564	2024-11-18 15:56:21 +00:00
Scott Wolchok	f630799587	move c10::overflows to its own header (#140564 ) Working on moving `complex<Half>` to complex.h instead of Half.h; this depends on complex and isn't used particularly widely. Differential Revision: [D65888038](https://our.internmc.facebook.com/intern/diff/D65888038/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/140564 Approved by: https://github.com/ezyang, https://github.com/Skylion007, https://github.com/malfet	2024-11-18 15:56:21 +00:00
Anant Gulati	b379a28a95	Generalization of distributed test cases for non-CUDA devices (#138216 ) # Motivation This pr is an extension of #131758. As described in #131758, these changes are looking to make distributed UTs more accessible to users of all device types. It is a demonstration of a few changes discussed by @kwen2501 and @jgong5 in the discussion for #131758(https://github.com/pytorch/pytorch/pull/131758#discussion_r1762422784) This PR contains two types of changes, the first is to the common distributed folder where we have added a new class derived from MultiProcessTestCase which helps abstracts out the process group creation /deletion and other functionality for a given device. The new generalized content can be added by deriving from this base class. Also includes other misc changes for gaudi support The second changed file is test_functional_api. a test file in common distributed. This file is a POC for how we can use this new class to write more device agnostic distributed test cases. The following changes have been made to test_functional_api.py: -Functionality has been added to test for non cuda devices using intel HPU as an example -Multiple set up steps previously required by MultiProcessTestCase have been abstracted out -Misc adaptations to allow for general call to accelerators while adding test skips instead explicitly skipping for multiple GPUs -Skipifhpu flags have been added to enable skipping a few Multithreaded test cases which are as yet not supported on HPUs NOTE: Within test functional api, there are tests which require the use of some multithreading functions which are as yet not supported on HPUs. These have been skipped for hpu using skipHPU decorator. I will be raising a separate PR to improve usability pf said decorators in a device agnostic setting in the manner suggested by @kwen2501 in a comment on this PR. This pr is a cleaned up version of a previous PR(#136988) which I closed due to human error. I have addressed some of the comments made by @kwen2501 in this as well Pull Request resolved: https://github.com/pytorch/pytorch/pull/138216 Approved by: https://github.com/kwen2501, https://github.com/guangyey	2024-11-18 09:38:00 +00:00
cyy	06dde8c157	[1/N] Remove inclusion of ATen/core/Array.h (#122064 ) The functionality of Array.h is largely overlapped with std::array and it should be safe to use std::array instead. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122064 Approved by: https://github.com/ezyang	2024-11-18 08:50:28 +00:00
PyTorch MergeBot	6c6f745fa7	Revert "[1/N] Remove inclusion of ATen/core/Array.h (#122064 )" This reverts commit 486b9aaa67a02807aea06f33c009b5311caab337. Reverted https://github.com/pytorch/pytorch/pull/122064 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but lots of compilation errors show up after this lands ([comment](https://github.com/pytorch/pytorch/pull/122064#issuecomment-2482263396))	2024-11-18 08:31:38 +00:00
fan.mo	43edb94f8a	[Quantization][PrivateUse1] Adding more support QuantizedPrivateuse1 backends (#139860 ) Here's are some explanations of this PR. 1. Changes in `aten/src/ATen/core/Tensor.cpp` and `c10/core/DispatchKey.cpp`: Support toString method for `QuantizedPrivateUse1` backend, make pytorch print out correct backend string for it. 2. Add header `DispatchStub.h` in `aten/src/ATen/native/quantized/IndexKernel.h`: If this header is not included, we can't utilize `masked_fill_kernel_quantized_stub` even we include this `IndexKernel.h` header, it would throw an error during compilation. 3. Add multiple `TORCH_API`s in `aten/src/ATen/native/quantized/AffineQuantizer.h`: these functions is useful for other privateuse1 backends supporting quantization functions, if these `TORCH_API` are missed, it would throw an error during runtime (undefined symbol) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139860 Approved by: https://github.com/bdhirsh	2024-11-18 05:09:59 +00:00
Will Constable	1d5a8ee8fb	[C10D] call destroy_process_group after MultiProcess tests (#140820 ) Faced with an annoying string of warnings like this when running tests, <img width="1644" alt="Screenshot 2024-11-15 at 11 23 21 AM" src="https://github.com/user-attachments/assets/91ff4e1d-3c29-4510-9a61-46e7df68a212"> My choices seem to be (1) call destroy_process_group() at the end of each test fn, (2) do this in some wrapper, (3) do it in the base test class. Since tests in MultiProcessTestCase are responsible for calling init_process_group themselves, they should also be responsible for calling destroy (or at least method (3) would be asymmetric and may result in double-destroy). But it doesn't feel worth it to go add a destroy call manually to each test, and try/except for a possible second destroy call seems like a happy middle ground. Note: tests that want to ensure that destroy runs cleanly can and should still call destroy _inside_ the test, and this change does not affect that. Pull Request resolved: https://github.com/pytorch/pytorch/pull/140820 Approved by: https://github.com/fegin	2024-11-18 04:26:21 +00:00
Yuanhao Ji	a1327fac45	[Dynamo] Replace `torch._dynamo.optimize()` with `torch.compile()` [5/N] (#140663 ) related commits: - #139706 - #140238 - #140247 - #140253 - #140663 - #140688 Pull Request resolved: https://github.com/pytorch/pytorch/pull/140663 Approved by: https://github.com/williamwen42	2024-11-18 04:11:56 +00:00
Yuanhao Ji	16bc82a015	[Dynamo] Replace `torch._dynamo.optimize()` with `torch.compile()` [6/N] (#140688 ) related commits: - #139706 - #140238 - #140247 - #140253 - #140663 - #140688 Pull Request resolved: https://github.com/pytorch/pytorch/pull/140688 Approved by: https://github.com/williamwen42	2024-11-18 04:09:09 +00:00
Yu, Guangye	62d2c5b667	Revert "Enable XPUEvent elapsed_time function (#134666 )" (#140872 ) # Motivation This PR raises an internal UT failure on XPU. This reverts commit 4bbd6da33101a8d709f1d2921ad8ae6f9b0dc166. # Additional Context refer to https://github.com/pytorch/pytorch/issues/140814 Pull Request resolved: https://github.com/pytorch/pytorch/pull/140872 Approved by: https://github.com/EikanWang	2024-11-18 02:58:05 +00:00
Yifu Wang	3d26c08dda	Fix unintended deprecation warning in torch.distributed.optim (#140889 ) We have a deprecation warning for scripted functional optimizer at module level in `torch/distributed/optim/__init__.py`. However, not all optimizers exposed by the module are scripted functional optimizers, causing some false deprecation warning (e.g. https://github.com/pytorch/pytorch/issues/139661). This PR moves the deprecation warning to the `__init__` functions of the deprecated scripted functional optimizers. Pull Request resolved: https://github.com/pytorch/pytorch/pull/140889 Approved by: https://github.com/d4l3k, https://github.com/kwen2501, https://github.com/XilunWu	2024-11-18 02:34:51 +00:00
chuanqiw	137554c943	[CI] Upgrade XPU support packages version to 2025.0 (#139775 ) Works for #139722 and #114850 Pull Request resolved: https://github.com/pytorch/pytorch/pull/139775 Approved by: https://github.com/atalman	2024-11-18 02:26:13 +00:00
cyy	486b9aaa67	[1/N] Remove inclusion of ATen/core/Array.h (#122064 ) The functionality of Array.h is largely overlapped with std::array and it should be safe to use std::array instead. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122064 Approved by: https://github.com/ezyang	2024-11-18 01:31:39 +00:00
Menglu Yu	c3fbec74bd	[PT2][Optimus] Fix a corner case in merge splits (#140788 ) Summary: We observed another corner case where not all split items are used, see the screenshot {F1960315622} We thus skip such cases by checking the getitem indices. Test Plan: # local reproduce ``` buck2 run mode/opt //scripts/jackiexu0313/pt2:local_model_with_pt2 -- --test_mode batch-split --flow_id 663157369 2>&1 \| tee ~/cmf.txt ``` P1679677122 # E2E before fix f663157369 after fix Differential Revision: D65990213 Pull Request resolved: https://github.com/pytorch/pytorch/pull/140788 Approved by: https://github.com/jackiexu1992	2024-11-18 01:27:43 +00:00
Will Constable	625c24a7f9	[C10D] Support group_dst in scatter/gather (+object) ops (#140827 ) Also add missing mypy typing and a few asserts to make mypy happy Partially addresses RFC 0042 (pytorch/rfcs#71) See more details/motivation in #140460 Note: object collective version canonicalizes to global instead of group rank, simply becuase this left more of the original code intact and required less conversions overall. Pull Request resolved: https://github.com/pytorch/pytorch/pull/140827 Approved by: https://github.com/kwen2501	2024-11-17 22:19:58 +00:00
Nikita Shulga	99014a297c	[BE][MPS] Apply clang-format to mps headers (#140906 ) It was a mistake to amiss them in the past All changes in this PR except ones to .lintrunner.toml are generated by running `lintrunner -a --take CLANGFORMAT --all-files` Pull Request resolved: https://github.com/pytorch/pytorch/pull/140906 Approved by: https://github.com/Skylion007	2024-11-17 21:06:27 +00:00
Yifu Wang	5a7e147ef3	[SymmetricMemory] introduce user-facing APIs empty() and rendezvous() (#139677 ) Previously `SymmetricMemory` only had private pybind APIs: ```python from torch.distributed._symmetric_memory import _SymmetricMemory t = _SymmetricMemory.empty_strided_p2p( size=(64,), stride=(1,), dtype=torch.float32, device=device, ) symm_mem_hdl = _SymmetricMemory.rendezvous(t, group_name=group.group_name) ``` This PR introduces user-facing APIs empty() and rendezvous(): ```python import torch.distributed._symmetric_memory as symm_mem t = symm_mem.empty(64, device="cuda") symm_mem_hdl = symm_mem.rendezvous(t, group_name=group.group_name) ``` Notable differences compared to the pybind APIs: - `empty()` now resembles `torch.empty()`: - shape can either be an integer sequence or pack - no need to/can't specify stride anymore - device can either be `torch.device` or string - `group_name` needs to be specified at rendezvous time as opposed to allocation time. See https://github.com/pytorch/pytorch/pull/139529 for the rationales. I feel the new semantic is superior, hence enforcing it in the public API. - Currently, the pybind API still support specifying `group_name` at rendezvous time. This PR does not change the behavior of the pybind APIs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139677 Approved by: https://github.com/lw ghstack dependencies: #139529	2024-11-17 20:51:50 +00:00
Bob Ren	9f4af6b4e6	Add trunc to z3 validator (#140886 ) Fixes vision_maskrcnn benchmark when validation is turned on Pull Request resolved: https://github.com/pytorch/pytorch/pull/140886 Approved by: https://github.com/ezyang ghstack dependencies: #140830, #140832, #140828	2024-11-17 18:38:30 +00:00
Bob Ren	9005156004	don't specialize when grad tracking tensors are activated (#140828 ) Fixes `python test/dynamo/test_inline_inbuilt_nn_modules.py InlineInbuiltNNModulesFuncTorchHigherOrderOpTests.test_grad_non_tensor_input_inline_inbuilt_nn_modules` when `specialize_float=False` Pull Request resolved: https://github.com/pytorch/pytorch/pull/140828 Approved by: https://github.com/ezyang ghstack dependencies: #140830, #140832	2024-11-17 18:28:47 +00:00
Bob Ren	e1d6c08f3d	Specialize symfloats when getting fake value involves complex args (#140832 ) Fixed `PYTORCH_TEST_WITH_DYNAMO=1 tlp python test/test_sparse_csr.py TestSparseCSRCPU.test_sampled_addmm_cpu_complex64` when `specialize_float=False` Pull Request resolved: https://github.com/pytorch/pytorch/pull/140832 Approved by: https://github.com/ezyang ghstack dependencies: #140830	2024-11-17 18:17:54 +00:00
Nikita Shulga	24be47f0c7	[MPS] Allow >2**32 metal dispatches (#140862 ) By passing length as `NSUInteger` which should be a 64-bit value on all 64-bit systems according to https://developer.apple.com/documentation/objectivec/nsuinteger?language=objc Pull Request resolved: https://github.com/pytorch/pytorch/pull/140862 Approved by: https://github.com/Skylion007	2024-11-17 18:05:44 +00:00
Nikita Shulga	4269250a30	[BE][EZ] Use nested namespaces (#140905 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/140905 Approved by: https://github.com/Skylion007	2024-11-17 17:53:00 +00:00
cyy	73602873c9	[10/N] Fix Wextra-semi warning (#140880 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/140880 Approved by: https://github.com/ezyang	2024-11-17 16:12:28 +00:00
Jason Ansel	2c6bd9f6f6	[inductor] Support fixed triton configs defined at compile time (#140217 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/140217 Approved by: https://github.com/shunting314 ghstack dependencies: #139585	2024-11-17 16:10:37 +00:00
Jason Ansel	318eaa2be7	[inductor] Refactor reduction type choices into V.choices (#139585 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139585 Approved by: https://github.com/shunting314	2024-11-17 16:10:37 +00:00
Nikita Shulga	44afaac9fd	[MPS][BE] Fix non-portable path warning (#140891 ) I.e. fixes ``` 1082/1084] Building OBJCXX object caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/mps/operations/UpSample.mm.o /Users/malfet/git/pytorch/pytorch/aten/src/ATen/native/mps/operations/UpSample.mm:224:10: warning: non-portable path to file '<ATen/native/mps/UpSample_metallib.h>'; specified path differs in case from file name on disk [-Wnonportable-include-path] 224 \| #include <ATen/native/mps/Upsample_metallib.h> \| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ \| <ATen/native/mps/UpSample_metallib.h> ``` as generated header name should have the same capitalization as respective shader file, i.e. `kernels/UpSample.metal` Pull Request resolved: https://github.com/pytorch/pytorch/pull/140891 Approved by: https://github.com/Skylion007	2024-11-17 15:14:05 +00:00
Xuehai Pan	90d3584147	[dyanmo] support subclasses of namedtuple type (#140534 ) Allow subclassing namedtuple type. Allow assign attributes to instances of these subtypes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/140534 Approved by: https://github.com/jansel	2024-11-17 14:13:40 +00:00
Yifu Wang	ab5c8857ef	[SymmetricMemory] support specifying group_name at rendezvous time (#139529 ) Before this PR, users need to call `empty_strided_p2p()` with a `group_name`: ```python tensor = _SymmetricMemory.empty_strided_p2p((1024,), (1,), device=device, group_name="0") symm_mem = _SymmetricMemory.rendezvous(tensor) ``` Users can now omit `group_name` at allocation time and specify it later at rendezvous time: ```python tensor = _SymmetricMemory.empty_strided_p2p((1024,), (1,), device=device) symm_mem = _SymmetricMemory.rendezvous(tensor, group_name="0") ``` Rationales for this change: - This allows the same allocation to establish symmetric memory under different groups - Specifying `group_name` at rendezvous time instead of allocation time is a more natural UX Pull Request resolved: https://github.com/pytorch/pytorch/pull/139529 Approved by: https://github.com/lw	2024-11-17 09:31:17 +00:00
Bob Ren	602ae9cbcf	Specialize symfloats during equality checks (#140830 ) Fixes `PYTORCH_OPINFO_SAMPLE_INPUT_INDEX=0 python test/inductor/test_torchinductor_opinfo.py TestInductorOpInfoCPU.test_comprehensive_nn_functional_local_response_norm_cpu_float32` when `specialize_float=False` Pull Request resolved: https://github.com/pytorch/pytorch/pull/140830 Approved by: https://github.com/ezyang	2024-11-17 06:35:22 +00:00
Edward Z. Yang	6094f17ada	Revert "revert test repro logging" (#140749 ) This reverts commit 6323fa673279eac9f2292b9b7790f621a4649af8. Pull Request resolved: https://github.com/pytorch/pytorch/pull/140749 Approved by: https://github.com/bobrenjc93 ghstack dependencies: #138634	2024-11-17 06:25:54 +00:00
Bin Bao	62fb6fd8bd	Fix broken AOTInductor node and kernel counts (#139435 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139435 Approved by: https://github.com/desertfire ghstack dependencies: #139411, #139412 Co-authored-by: Bin Bao <binbao@meta.com>	2024-11-17 04:17:07 +00:00
Bin Bao	83e62cbc18	Enable all fixed cpp_wrapper tests (#139412 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139412 Approved by: https://github.com/desertfire ghstack dependencies: #139411 Co-authored-by: Bin Bao <binbao@meta.com>	2024-11-17 04:17:07 +00:00
Bin Bao	819b0ebd94	cpp_wrapper_cpu: Ensure reinterpret_view results in RAIIAtenTensorHandle (#139411 ) Fixes segfaults caused by views being implicitly converted to AtenTensorHandle, then being destroyed before use. Closes #135559. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139411 Approved by: https://github.com/desertfire Co-authored-by: Bin Bao <binbao@meta.com>	2024-11-17 04:16:59 +00:00
PyTorch UpdateBot	2fc692b3dd	[audio hash update] update the pinned audio hash (#140860 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned audio hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/140860 Approved by: https://github.com/pytorchbot	2024-11-17 03:34:54 +00:00
chilli	c1f21bf2b6	Made FlexAttention error on subgraph lowering failure (#140331 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/140331 Approved by: https://github.com/drisspg	2024-11-17 02:43:58 +00:00
Tugsbayasgalan Manlaibaatar	b86b5349cb	Ignore eager profiling code in training IR (#140826 ) Differential Revision: [D66010452](https://our.internmc.facebook.com/intern/diff/D66010452/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/140826 Approved by: https://github.com/zhxchen17	2024-11-16 20:31:17 +00:00
PyTorch MergeBot	bf8709b08a	Revert "[C10D] call destroy_process_group after MultiProcess tests (#140820 )" This reverts commit 77d1f076dadec7a77c4bcf807c4efbef6ca5a8f1. Reverted https://github.com/pytorch/pytorch/pull/140820 on behalf of https://github.com/wconstab due to failures on trunk not on PR CI ([comment](https://github.com/pytorch/pytorch/pull/140820#issuecomment-2480644227))	2024-11-16 16:32:14 +00:00
Edward Z. Yang	ce77409647	Upgrade to fbscribelogger 0.1.7 (#138634 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/138634 Approved by: https://github.com/huydhn	2024-11-16 14:33:34 +00:00
Will Constable	77d1f076da	[C10D] call destroy_process_group after MultiProcess tests (#140820 ) Faced with an annoying string of warnings like this when running tests, <img width="1644" alt="Screenshot 2024-11-15 at 11 23 21 AM" src="https://github.com/user-attachments/assets/91ff4e1d-3c29-4510-9a61-46e7df68a212"> My choices seem to be (1) call destroy_process_group() at the end of each test fn, (2) do this in some wrapper, (3) do it in the base test class. Since tests in MultiProcessTestCase are responsible for calling init_process_group themselves, they should also be responsible for calling destroy (or at least method (3) would be asymmetric and may result in double-destroy). But it doesn't feel worth it to go add a destroy call manually to each test, and try/except for a possible second destroy call seems like a happy middle ground. Note: tests that want to ensure that destroy runs cleanly can and should still call destroy _inside_ the test, and this change does not affect that. Pull Request resolved: https://github.com/pytorch/pytorch/pull/140820 Approved by: https://github.com/fegin ghstack dependencies: #140460, #140815	2024-11-16 14:24:52 +00:00
Will Constable	f8891a764d	[C10D] dedup send/recv impls (#140815 ) Avoid copypaste of send/isend and recv/irecv impl. This does change the warning issued from send to include the identifier "isend" instead of "send", but I think thats not a big deal. Pull Request resolved: https://github.com/pytorch/pytorch/pull/140815 Approved by: https://github.com/fegin ghstack dependencies: #140460	2024-11-16 14:24:52 +00:00
Will Constable	3d4e68fad3	[C10D] Support group_dst/group_src in c10d send/recv (#140460 ) Partly addressing RFC 0042 (https://github.com/pytorch/rfcs/pull/71) It's annoying that 'dst' (for send) ust be a global rank even when a group is passed in. But we can't easily change 'dst' without breaking existing cases. Furthermore, requiring use of 'global' dst breaks the less common usage pattern of creating a new ProcessGroup object that is not connected to the 'default group' and thus has no logical 'global' ranks. Pull Request resolved: https://github.com/pytorch/pytorch/pull/140460 Approved by: https://github.com/d4l3k, https://github.com/kwen2501, https://github.com/fduwjj	2024-11-16 14:24:45 +00:00
Angela Yi	2b39a8db77	Refactor UnflattenedModule's adapt flat args (#140840 ) Test Plan: unblocks model launch Differential Revision: D66014709 Pull Request resolved: https://github.com/pytorch/pytorch/pull/140840 Approved by: https://github.com/pianpwk	2024-11-16 05:09:37 +00:00
drisspg	0f9eea1329	[FlexAttention] Fix multiple calls to flex bug (#140761 ) # Summary Fixes long-standing bug we've had in the backward pass for flex attention. See https://github.com/pytorch/pytorch/issues/135161 for details Pull Request resolved: https://github.com/pytorch/pytorch/pull/140761 Approved by: https://github.com/Chillee, https://github.com/zou3519	2024-11-16 04:57:04 +00:00
Oguz Ulgen	a173186566	[RFC] Implement caching for user defined triton kernels (#140326 ) This PR adds caching for user defined triton kernels by putting the transitive closure of source code in node.meta along with constant arguments. One HUGE hack we do here is a node looks like ``` triton_kernel_wrapper_functional_proxy = torch.ops.higher_order.triton_kernel_wrapper_functional(kernel_idx = 0, constant_args_idx = 1, grid = [(1, 1, 1)], tma_descriptor_ metadata = {}, kwargs = {'in_ptr0': arg0_1, 'in_ptr1': arg1_1, 'out_ptr': arg0_1}, tensors_to_clone = ['out_ptr']); ``` so we use regex to remove `kernel_idx = 0, constant_args_idx = 1` parts as they are not relevant to cache hash. This is horrible and I'd like to eventually not use pickle as a hashing alternative but this is a longer project. Differential Revision: [D65895744](https://our.internmc.facebook.com/intern/diff/D65895744) Pull Request resolved: https://github.com/pytorch/pytorch/pull/140326 Approved by: https://github.com/zou3519	2024-11-16 02:37:16 +00:00
Chirag Pandya	48a55b8623	[c10d][fr] wait counter for dump function (#140823 ) Summary: Add a wait counter for the dump function. This is useful to see if we get stuck in the dump function and never return for a particular job. Test Plan: Tested locally I and see `pytorch.wait_counter.NCCLTraceBuffer__dump.busy_time_us.sum.60` in ODS. Differential Revision: D65823433 Pull Request resolved: https://github.com/pytorch/pytorch/pull/140823 Approved by: https://github.com/fduwjj	2024-11-16 02:22:08 +00:00
Sidney Tsang	be90d3ce86	[IG] Avoid generation of empty merge cpu submodule by splitter v2 (#140794 ) Summary: Customize splitter behavior to mark `get_attr` nodes as acc supported. Currently these nodes are excluded by `FxNetAccNodesFinder` which marks all nodes with op not in `CALLABLE_NODE_OPS` ("call_module", "call_function", "call_method") as unsupported. Before this change, merge-net is split into an almost empty cpu submodule with a single empty output node: ``` INFO:caffe2.torch.fb.model_transform.experimental.prepare_fx_model:###### debug_print nodes for _run_on_cpu_0 INFO:caffe2.torch.fb.model_transform.experimental.prepare_fx_model:Found output node: n.name='output', n.target='output', n.args=((),), n.kwargs={}, n.meta={} INFO:caffe2.torch.fb.model_transform.experimental.prepare_fx_model:return () INFO:caffe2.torch.fb.model_transform.experimental.prepare_fx_model: _run_on_cpu_0 stats for merge: [output] output: 1 ``` full log: P1678727348 (generated using same command as below) Test Plan: Tested by lowering `ig_organic_feed_cn_v2_mtml` using cmd: ``` buck run mode/opt-split-dwarf //tgif/cli:cli -- --model-name=ig_organic_feed_cn_v2_mtml --model-type ig_organic_feed_cn_v2_mtml --world-size=1 --storage-mode 1 --inference-dtype=FP16 --meta-transform=False --use-random-weights=True --accelerator-arch=3 --enable-input-dist=True --embedding-tables-dtype=FP16 --mtia-use-torch-export=True embedding-quantization-pass torchrec-sharding-pass tgif-split-pass gen-app-graph-pass tgif-mtia-lowering-pass dense-quantization-pass save-torch-package-pass generate-model-package-pass pack-weights-and-save-pass 2>&1 \| tee /tmp/publish_ig_organic_feed_cn_v2_mtml_mtia_export_20241114_splitter_2.log ``` Output shows only 1 acc submodule is generated for merge: ``` INFO 18:33:15.951 1735650 utils.py:235: [TGIF] num of acc submodules: 1 INFO 18:33:15.952 1735650 utils.py:236: [TGIF] num of cpu submodules: 0 INFO 18:33:16.534 1735650 logging_utils.py:53: [TGIF] _run_on_acc_0 graph module debug info: https://www.internalfb.com/intern/everpaste/?color=0&handle=GK4VKhWsDKF9VdsDAKxhR6KAlhJ0br0LAAAz INFO 18:33:16.534 1735650 utils.py:257: [TGIF] Start MTIA lowering _run_on_acc_0 in merge, device ordinal: -1 ``` full log: P1679596796 Differential Revision: D65983916 Pull Request resolved: https://github.com/pytorch/pytorch/pull/140794 Approved by: https://github.com/ezyang	2024-11-16 01:49:03 +00:00
Jing Shan	bf78a0fa96	Add dim to logging to help debug (#140445 ) Differential Revision: D65839759 Pull Request resolved: https://github.com/pytorch/pytorch/pull/140445 Approved by: https://github.com/ljyuva83, https://github.com/ColinPeppler	2024-11-16 01:33:29 +00:00
Scott Wolchok	5df9207ba9	Don't go through dispatch for *_dot_with_fp32_arith (#140834 ) We don't need to dispatch for these because they're only used from within ATen/native/cpu, which is rebuilt per-CPU_CAPABILITY anyway. Differential Revision: [D66012283](https://our.internmc.facebook.com/intern/diff/D66012283/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/140834 Approved by: https://github.com/malfet	2024-11-16 00:30:25 +00:00
Angela Yi	baf756a785	[reland] [aoti] Selectively package AOTI generated files (#140675 ) Summary: Reland https://github.com/pytorch/pytorch/pull/140022 Test Plan: CI Differential Revision: D65929964 Pull Request resolved: https://github.com/pytorch/pytorch/pull/140675 Approved by: https://github.com/desertfire	2024-11-15 23:48:34 +00:00
PyTorch MergeBot	109f8274a8	Revert "Add NHWC support for group normalization (#126635 )" This reverts commit ed0e63e938317fd254a705f00580caeb68768f9c. Reverted https://github.com/pytorch/pytorch/pull/126635 on behalf of https://github.com/kit1980 due to Reverted internally at Meta, see D65979564 ([comment](https://github.com/pytorch/pytorch/pull/126635#issuecomment-2480130943))	2024-11-15 23:38:15 +00:00
David R. Reich	0aed13437e	remove typo in UninitializedParameter docstring (#140197 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/140197 Approved by: https://github.com/Skylion007	2024-11-15 23:26:23 +00:00
Mikayla Gawarecki	41bb1539d3	Fix get_unsafe_globals_in_checkpoint to account for user allowed globals per docstring (#140738 ) bugfix: this function did not account for the user allowed globals :( Differential Revision: [D65960696](https://our.internmc.facebook.com/intern/diff/D65960696) Pull Request resolved: https://github.com/pytorch/pytorch/pull/140738 Approved by: https://github.com/malfet	2024-11-15 22:47:35 +00:00
Catherine Lee	fc813df120	Benchmarks dynamo update script to use ClickHouse instead of Rockset (#140574 ) Query works but the part where it parses the job name is broken Pull Request resolved: https://github.com/pytorch/pytorch/pull/140574 Approved by: https://github.com/huydhn	2024-11-15 22:17:35 +00:00
Max Podkorytov	d64827dc35	[ROCm][Inductor][CK] Enable scaled mm with bias in gemm max autotune with CK backend (#140674 ) ## Testing ``` pytest test/inductor/test_ck_backend.py -k scaled_mm ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/140674 Approved by: https://github.com/chenyang78	2024-11-15 22:08:38 +00:00
Michael Lazos	ffd5197138	Ensure index for state guard construction is a source (#140515 ) Fixes https://github.com/pytorch/pytorch/issues/140393 Pull Request resolved: https://github.com/pytorch/pytorch/pull/140515 Approved by: https://github.com/anijain2305, https://github.com/vmoens	2024-11-15 22:02:50 +00:00
Michael Lazos	1fd4757fdc	Support tensor betas in Adam and AdamW (#134171 ) Adds support for beta1 and beta2 to be wrapped in tensor for Adam and AdamW. Fixes https://github.com/pytorch/pytorch/issues/133898 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134171 Approved by: https://github.com/janeyx99	2024-11-15 21:55:55 +00:00
Chien-Chin Huang	924c1fe3f3	[CP] Enable CP + compiler tests when there are more than 2 GPUs (#133736 ) https://github.com/pytorch/pytorch/pull/132755 makes c10d_functional.wait_tensor effectful ORDERED op, which should resolve any issues due to dangling wait for CP ring attention. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133736 Approved by: https://github.com/Skylion007, https://github.com/XilunWu	2024-11-15 20:42:51 +00:00
Nitin Jain	476e0697f5	Fix for split gates enabled quantizable LSTM subclass (#140818 ) Summary: ### Motivation In D65283170, we need subclass of quantizable LSTM to enable split_gates. Also, required for tests. ### What's the change? As subclass is not part of no_observer() set, an improper observer is added after the quantizable LSTM module. Here, we switch class check change to issubclass check on no_observer set. Test Plan: - N6206576 - CI. Reviewed By: andrewor14 Differential Revision: D65989314 Pull Request resolved: https://github.com/pytorch/pytorch/pull/140818 Approved by: https://github.com/andrewor14	2024-11-15 20:15:52 +00:00
PyTorch MergeBot	03b7ec9237	Revert "create a new torch.cuda.memory_usage_in_bytes api (#140719 )" This reverts commit 9febc476372e25f65cfcd642bf49625db10f0f0b. Reverted https://github.com/pytorch/pytorch/pull/140719 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but the test is flaky on ROCm ([comment](https://github.com/pytorch/pytorch/pull/140719#issuecomment-2479832082))	2024-11-15 20:05:32 +00:00
PyTorch MergeBot	210de39872	Revert "[FlexAttention] Fix multiple calls to flex bug (#140761 )" This reverts commit b506d1cc8aee0d17cb72c2be0bc03361d4023698. Reverted https://github.com/pytorch/pytorch/pull/140761 on behalf of https://github.com/huydhn due to Sorry for reverting this, but it is failing on ROCm ([comment](https://github.com/pytorch/pytorch/pull/140761#issuecomment-2479819212))	2024-11-15 19:58:37 +00:00
Benjamin Glass	47f44303ff	Add ciflow/inductor automatically in more cases (#140824 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/140824 Approved by: https://github.com/malfet	2024-11-15 19:54:20 +00:00
xin.li	80d63e7dd9	Fix softmax_backward_data cpu implementation error when argument output is noncontinguous (#139740 ) Implementation of the `softmax_backward_data` operator for the CPU backend produces incorrect results when the `output` argument is non-contiguous. Here is a test case that demonstrates this issue: ```python torch.manual_seed(0) op = torch.ops.aten._softmax_backward_data grad_output = torch.ones(3, 3, 3) temp = torch.randn(3, 10, 3) out = temp[:, :3, :] out = out.contiguous() print(out.is_contiguous()) grad_input = op(grad_output, out, 1, torch.float32) print(grad_input) ``` In this test case, the variable `grad_input` yields incorrect results if the line `out = out.contiguous()` is commented out. With this fix, `grad_input` consistently produces the same results whenever `output` is contiguous. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139740 Approved by: https://github.com/zou3519	2024-11-15 19:53:20 +00:00
cz2h	9602f56979	Fix misuse of offset param in seek (#140633 ) Fixes #115630. The size of BufferAdapter has been calculated wrongly due to misuse of python method seek. Causes miniz reader initialized with wrong size. Pull Request resolved: https://github.com/pytorch/pytorch/pull/140633 Approved by: https://github.com/ezyang Co-authored-by: Edward Z. Yang <ezyang@fb.com>	2024-11-15 19:07:52 +00:00
Laith Sakka	500ce29e4c	Use has_free_unbacked_symbols instead of bool(free_unbacked_symbols) (#140027 ) with 20K features saves 20 seconds. 257.021589517593-> 237.8304626941681 buck2 run @fbcode//mode/opt fbcode//torchrec/distributed/tests:pt2_compile_benchmark -- --num-features=2000 Pull Request resolved: https://github.com/pytorch/pytorch/pull/140027 Approved by: https://github.com/ezyang	2024-11-15 19:01:06 +00:00
Jack Taylor	4caf6a1fc8	[ROCm] Bug fix for flex attention configs avoiding ROCm path (#140270 ) Fixes https://github.com/pytorch/pytorch/issues/139755 https://github.com/pytorch/pytorch/issues/139621 Follow up fix to https://github.com/pytorch/pytorch/pull/139883 which made the bulk of the changes required but a logic error resulted in ROCm still using h100 configurations. Pull Request resolved: https://github.com/pytorch/pytorch/pull/140270 Approved by: https://github.com/bertmaher	2024-11-15 17:52:56 +00:00
Ryan Guo	8e1f96469b	[dynamo] Remove the `name_stack` code paths in `symbolic_convert.py` (#140155 ) This is no longer needed now that we've replaced `ClosureVariable` with `NewCellVariable`, i.e., Dynamo now treats `LOAD_CLOSURE` the same as `LOAD_FAST`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/140155 Approved by: https://github.com/jansel, https://github.com/williamwen42 ghstack dependencies: #140330, #140152, #140436, #140435, #140153, #140154	2024-11-15 17:17:30 +00:00
Ryan Guo	54dde12c37	[dynamo] Remove `closure_cells` and merge/remove code paths (#140154 ) Now that all cells are modeled as `NewCellVariable` in Dynamo, we no longer need to put cell variables into this special `closure_cells`, rather we just merge `closure_cells` with `symbolic_locals`. This allows us to merge and remove some code paths, notably make `LOAD_CLOSURE` the same as `LOAD_FAST`, and `LOAD_DEREF` & `STORE_DEREF` the same for inlining or regular `InstructionTranslator`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/140154 Approved by: https://github.com/jansel ghstack dependencies: #140330, #140152, #140436, #140435, #140153	2024-11-15 17:17:30 +00:00
Ryan Guo	ea1d11cf74	[dynamo] Represent all cells as `NewCellVariable` (#140153 ) In addition to `NewCellVariable`, Dynamo has 3 ways of modeling cell objects: 1. For cells captured and created by the root frame, represent them as their contents in `root_tx.symbolic_locals`, which `LOAD_DEREF` and `STORE_DEREF` update directly, without going through `SideEffects`. 2. `ClosureVariable`: this is created when cells from (1) are captured by a newly created function Dynamo is about to inline. It's a handle with a name that redirects `LOAD_DEREF` and `STORE_DEREF` back (1), to make `root_tx.symbolic_locals` up-to-date. 3. For cells that are captured by both the root frame and some pre-existing function Dynamo is about to inline, represent those cells as contents, and do not allow writes to them. Note that (2) and (3) are mainly to conform with (1) -- to make sure Dynamo has a consistent modeling of cells for the same cell objects. In this patch, we represent all of these cells as `NewCellVariable`. The main new code paths introduced are: - using `NewCellVariable` to model cell objects created by the root frame (the cells are passed in as input to `InstructionTranslator`), this is what allows us to get rid of all 3 legacy paths above. - adding a new `AutoDerefLocalSource` to deal with the python-code level (guards) and bytecode level (codegen) auto-dereferencing behavior, when accessing pre-existing python cells. This also involves a tiny update to guard manager generation. - plumbing some extra info into `LocalSource` and `CellVariable` so that we can still emit `LOAD_DEREF`, `STORE_DEREF`, `LOAD_CLOSURE` (instead of `make_cell`, `cell_contents` attribute access, and `LOAD_FAST`), which is important for readability, performance, and some assumptions `bytecode_transformation.py` makes. As a result, this patch removes a lot of the now-dead code paths and TODOs. Notably, it significantly simplified the `prune_dead_locals` function, which was duplicating a lot of the logic from `prune_dead_object_new`; this conveniently closes #137123. Pull Request resolved: https://github.com/pytorch/pytorch/pull/140153 Approved by: https://github.com/jansel ghstack dependencies: #140330, #140152, #140436, #140435	2024-11-15 17:17:30 +00:00
Ryan Guo	7faee6bf15	[dynamo] Track from registered tensor hooks in `prune_dead_object_new` (#140435 ) Registed tensor hooks contain `NestedUserFunctionVariable` which might capture a `NewCellVariable` for cell objects created during Dynamo tracing, so we must make sure it doesn't get pruned away. Pull Request resolved: https://github.com/pytorch/pytorch/pull/140435 Approved by: https://github.com/jansel, https://github.com/zou3519 ghstack dependencies: #140330, #140152, #140436	2024-11-15 17:17:30 +00:00
Ryan Guo	ac6684ebbc	[dynamo] Identify pre-existing captured cells by cell id rather than content id (#140436 ) In `match_nested_cell`, Dynamo tried to identify pre-existing captured cells by `(cell_name, id(cell_contents))`. This works in most cases, but as the test added in this patch shows, it's not a complete solution. This patch 1. changes `match_nested_cell` to `lookup_variable_for_captured_cell`, and does the lookup based on id of cell objects, not their contents. This requires plumbing a tuple of captured cell objects from different CPython versions all the way to `InstructionTranslator.__init__`, where we store a mapping from the ids of these cell objects, and use it later in `UserFunctionVariable.bind_args` to look for these unboxed cells. 2. builds off (1) -- rather than using a `VariableTracker` that represents the content of the unboxed cells, use `ClosureVariable`, which enables codegen in case these cells escape as closure of a `NestedUserFunctionVariable`. The patch adds a regression test for each of the scenarios above: 1. `test_write_to_cells_with_name_shadowing` where Dynamo mistakenly thought the program is writing to a cell captured by root frame (which it doesn't support atm), which resulted in ``` File "/Users/ryanguo99/Documents/work/pytorch/torch/_dynamo/symbolic_convert.py", line 3340, in STORE_DEREF unimplemented("write to __closure__ while inlining") File "/Users/ryanguo99/Documents/work/pytorch/torch/_dynamo/exc.py", line 313, in unimplemented raise Unsupported(msg, case_name=case_name) torch._dynamo.exc.Unsupported: write to __closure__ while inlining ``` 2. `test_existing_func_that_creates_capturing_nested_func` where Dynamo ended up trying to codegen a `NestedUserFunctionVariable` that captures a cell which was also captured by the root frame, so it was unboxed and ends up emitting `LOAD_DEREF` rather than `LOAD_FAST/LOAD_CLOSURE` during codegen, resulting in ``` File "/Users/ryanguo99/Documents/work/pytorch/torch/_dynamo/variables/functions.py", line 105, in _create_nested_fn func = FunctionType(code, f_globals, name, defaults, closure) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ TypeError: arg 5 (closure) expected cell, found int ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/140436 Approved by: https://github.com/jansel, https://github.com/williamwen42 ghstack dependencies: #140330, #140152	2024-11-15 17:17:30 +00:00
Ryan Guo	a4032d8396	[dynamo] Use `ExecutionRecorder` only in root frame `InstructionTranslator` (#140152 ) As title. This is effectively what ended up happening anyways since we always overwrite the record with the current frame's while propagating the exception upward in `InstructionTranslatorBase.run`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/140152 Approved by: https://github.com/jansel, https://github.com/mlazos ghstack dependencies: #140330	2024-11-15 17:17:30 +00:00
Ryan Guo	85dd7b84cf	[dynamo] Add a `DynamoFrameType` type above Python frame object (#140330 ) This patch introduces a `DynamoFrameType` to serve as a layer between Dynamo and different versions of Python frame object. In `DynamoFrameType`, we only register attributes Dynamo cares about (e.g., `f_code`, `f_locals`, etc. This will be helpful when it comes to adding new attributes to this `DynamoFrameType`, or dealing with Python version changes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/140330 Approved by: https://github.com/jansel, https://github.com/williamwen42	2024-11-15 17:17:30 +00:00
Aaron Gokaslan	c05eff278a	[BE][Ez]: Update ruff to 0.7.4 (#140806 ) Updates ruff to 0.7.4, mainly updates false pos/negatives for rules and fixes some bad autofixes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/140806 Approved by: https://github.com/cyyever, https://github.com/malfet	2024-11-15 17:04:32 +00:00
PyTorch MergeBot	de34f581f1	Revert "Made FlexAttention error on subgraph lowering failure (#140331 )" This reverts commit e68bc76c28934561e336f0fba8ef71bcea401701. Reverted https://github.com/pytorch/pytorch/pull/140331 on behalf of https://github.com/malfet due to Looks like it regressed trunk, see `55f1959fc1/1` ([comment](https://github.com/pytorch/pytorch/pull/140331#issuecomment-2479435705))	2024-11-15 17:00:21 +00:00
cyy	55f1959fc1	[12/N] Fix extra warnings brought by clang-tidy-17 (#140801 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/140801 Approved by: https://github.com/Skylion007	2024-11-15 16:54:30 +00:00
Sam Larsen	e2e67a010a	[logging] Add dynamo_compile fields for pre-dispatch/joint/post-dispatch times (#140306 ) Tested internally: P1679622670 Differential Revision: [D65986059](https://our.internmc.facebook.com/intern/diff/D65986059) Pull Request resolved: https://github.com/pytorch/pytorch/pull/140306 Approved by: https://github.com/ezyang	2024-11-15 15:02:08 +00:00
cyy	1b95ca904f	[9/N] Fix Wextra-semi warning (#140803 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/140803 Approved by: https://github.com/lw	2024-11-15 14:01:43 +00:00
Syed Tousif Ahmed	25d9be37be	Implements user buffer registration using MemPool (#133603 ) This PR implements user buffer registration and demonstrates NVLink Sharp (NVLS) reductions using a combination of allocation special memory using MemPool and registering it with the nccl buffer registration APIs. Part of https://github.com/pytorch/pytorch/issues/124807. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133603 Approved by: https://github.com/kwen2501, https://github.com/eqy	2024-11-15 12:47:49 +00:00
Yutao Xu	ae7f809bfc	Update torch-xpu-ops commit pin (#140782 ) Update the torch-xpu-ops commit to [bf4bab1](`bf4bab1fff`), includes: - Fix Werror=terminate relevant building issues Pull Request resolved: https://github.com/pytorch/pytorch/pull/140782 Approved by: https://github.com/EikanWang	2024-11-15 10:10:52 +00:00
wangyicheng	ee3a4f068c	[FSDP2] privateuse1 support fsdp2. (#139539 ) We are looking forward to supporting FSDP2 with devices other than CUDA. Please give me some coding suggestions. Thank you very much. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139539 Approved by: https://github.com/kwen2501	2024-11-15 06:34:35 +00:00
drisspg	b506d1cc8a	[FlexAttention] Fix multiple calls to flex bug (#140761 ) # Summary Fixes long-standing bug we've had in the backward pass for flex attention. See https://github.com/pytorch/pytorch/issues/135161 for details Pull Request resolved: https://github.com/pytorch/pytorch/pull/140761 Approved by: https://github.com/Chillee, https://github.com/zou3519	2024-11-15 06:28:20 +00:00
Yu Guo	9febc47637	create a new torch.cuda.memory_usage_in_bytes api (#140719 ) Summary: the current torch.cuda.memory_usage returns the memory utilization, more specifically, percent of time over the past sample period global memory being read/written for Nvidia. see more details in https://github.com/pytorch/pytorch/issues/140638 Test Plan: added a new unittest Differential Revision: D65928031 Pull Request resolved: https://github.com/pytorch/pytorch/pull/140719 Approved by: https://github.com/xw285cornell, https://github.com/hongxiayang	2024-11-15 05:59:40 +00:00
CaoE	6c0a2d8bbf	Fix the check for can_use_expanded_index_path (#140351 ) Fixes #129093 Pull Request resolved: https://github.com/pytorch/pytorch/pull/140351 Approved by: https://github.com/mingfeima, https://github.com/cpuhrsch	2024-11-15 05:52:23 +00:00
Haroun Habeeb	8043e67026	catch tensor.numel() == 0 in nan detector (#140741 ) Context: we are trying to pass an empty tensor through the system now (sometimes;... its an edge case); and it seems to cause all_reduce to seg fault, which is unexpected to me Deep Shah and Pavan identified the issue, I'm just pushing for a fix :) Test Plan: idk what i'm doing here, someone help Reviewed By: shuqiangzhang Differential Revision: D65956095 Pull Request resolved: https://github.com/pytorch/pytorch/pull/140741 Approved by: https://github.com/shuqiangzhang	2024-11-15 05:03:20 +00:00
titaiwangms	865a7c5238	[ONNX] Improve the conversion of `from dynamic axes to shapes` (#140488 ) Features: (1) Add support for tree structure. (2) Add user warning before axes to shapes conversion (3) Add suggestion of providing `dynamic_shapes` when conversion fails Notes: (1) `input_names` is crucial to the conversion, as we don't know the ONNX graph inputs. (2) min and max are set as default, so LLM has higher chance to fail if users use `dynamic_axes` in terms of the min/max constraints dependency between `attention_mask` and `sequence_length`, etc. (Found in llama-3.2-1B_Instruct) Pull Request resolved: https://github.com/pytorch/pytorch/pull/140488 Approved by: https://github.com/justinchuby Co-authored-by: Justin Chu <justinchuby@users.noreply.github.com>	2024-11-15 04:26:45 +00:00
Justin Chu	94824766e6	[ONNX] Separate decomp into single step and add to the report (#140767 ) 1. Fix the ordering of the error report entries so non-strict show on top 2. Isolate run_decomposition into a separate step because it sometimes fails. This makes it easier for users to understand what failed Fix https://github.com/pytorch/pytorch/issues/140762 Fix https://github.com/pytorch/pytorch/issues/137638 Pull Request resolved: https://github.com/pytorch/pytorch/pull/140767 Approved by: https://github.com/titaiwangms	2024-11-15 04:26:16 +00:00
chilli	e68bc76c28	Made FlexAttention error on subgraph lowering failure (#140331 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/140331 Approved by: https://github.com/drisspg	2024-11-15 04:26:01 +00:00
Shuqiang Zhang	80aa19a622	[PGNCCL] Add an API to get the status/error code of each PG (#140087 ) Summary: If unhealthy, the user should be able to get the type of errors, e.g., timeout,nccl error or remote error. This API is applied to PG level, compared to the work.get_future_result() API which is applied to Work Level. Error detection at PG level is much more convenient for users to handle the PG failure as a whole, e.g, restarting the PG. Error handling at the work level is still useful for users to attach work specific context and debug the RC of the specific failing work/collective Note it is critical for all ranks in the PG to be notified about an error as soon as it occurs, so we introduce an errorType of REMOTE_ERROR, which is 'broadcasted' from a src rank (which detects a local error) to all other ranks in the PG, the broadcast is done through TCPStore currently Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/140087 Approved by: https://github.com/kwen2501	2024-11-15 04:11:00 +00:00
Nikita Shulga	9c88b08ac9	[BE] Replace `skipIfMPS` with `expectedFailureMPS` (#139940 ) Functionally two decorators are very similar, but one should rely on expectedFailure as much as possible to get signal when something is fixed. - Move `product_version` variable from `test_mps` to common_utils, but call it `MACOS_VERSION` - Introduce `skipIfMPSOnMacOS13` to decorate the hard crashes that happens only on MacOS13 (which at this point will not get any fixes and will be deprecated soon) - Add `device_type='mps'` to all `skipIfMPS` per https://github.com/pytorch/pytorch/issues/140560 Pull Request resolved: https://github.com/pytorch/pytorch/pull/139940 Approved by: https://github.com/janeyx99, https://github.com/huydhn	2024-11-15 03:48:37 +00:00
Jeff Daily	1c1d06a22c	[ROCm] remove size restrictions in gemm_and_bias (#140724 ) This aligns hipblaslt behavior with CUDA_VERSION >= 12010. Pull Request resolved: https://github.com/pytorch/pytorch/pull/140724 Approved by: https://github.com/pruthvistony, https://github.com/eqy	2024-11-15 02:23:27 +00:00
Nikita Shulga	baf8686aec	[BE][MPS] Remove extra semicolons (#140776 ) Fixes following warnings: ``` In file included from /Users/malfet/git/pytorch/pytorch/torch/csrc/Generator.cpp:25: /Users/malfet/git/pytorch/pytorch/aten/src/ATen/mps/MPSGeneratorImpl.h:40:63: warning: extra ';' after member function definition [-Wextra-semi] 40 \| void set_engine(at::Philox4_32 engine) { engine_ = engine; }; \| ^ /Users/malfet/git/pytorch/pytorch/aten/src/ATen/mps/MPSGeneratorImpl.h:41:46: warning: extra ';' after member function definition [-Wextra-semi] 41 \| at::Philox4_32 engine() { return engine_; }; \| ^ /Users/malfet/git/pytorch/pytorch/aten/src/ATen/mps/MPSGeneratorImpl.h:43:62: warning: extra ';' after member function definition [-Wextra-semi] 43 \| static DeviceType device_type() { return DeviceType::MPS; }; \| ^ 3 warnings generated. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/140776 Approved by: https://github.com/Skylion007	2024-11-15 01:47:55 +00:00
atalman	cec82c3aed	Use Manylinux 2.28 for aarch64 CPU workflows (#140743 ) Use https://hub.docker.com/r/pytorch/manylinux2_28_aarch64-builder/tags Similar to https://github.com/pytorch/pytorch/pull/138732 Pull Request resolved: https://github.com/pytorch/pytorch/pull/140743 Approved by: https://github.com/malfet	2024-11-15 01:46:29 +00:00
Zhou, Lingzhi	33191bb664	[Partitioner] Enumerate partitions by iterating partition ids (#136598 ) Currently, we get all partition id by iterating assignment whose size is same as the number of nodes in graph. But we can reach same results by iterating partitions_by_id whose size is much smaller than the nodes number. Assume the number of nodes is N, the number of partitions is P, the time complexity decrease from O(N * N) to O(N * P) after this patch. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136598 Approved by: https://github.com/mcr229 Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>	2024-11-15 00:25:14 +00:00
Ke Wen	14ecbfe184	Add kwen2501 to CODEOWNERS of c10d backend APIs (#140231 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/140231 Approved by: https://github.com/shuqiangzhang	2024-11-14 23:58:51 +00:00
Zhenbin Lin	217d328764	OpenReg: Support autograd (#140662 ) Add some unfinished implements to support autograd. Pull Request resolved: https://github.com/pytorch/pytorch/pull/140662 Approved by: https://github.com/ezyang	2024-11-14 23:47:56 +00:00
Yifu Wang	02d0c43c32	[SymmetricMemory] fix a bug in symm_mem::memset32_ where the ops fails when offset=0 (#140129 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/140129 Approved by: https://github.com/lw ghstack dependencies: #140127, #140128	2024-11-14 23:29:16 +00:00
Yifu Wang	684db9beb2	[SymmetricMemory] fix a bug where get_signal_pad() returns a tensor backed by a buffer ptr instead of a signal_pad ptr (#140128 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/140128 Approved by: https://github.com/lw ghstack dependencies: #140127	2024-11-14 23:29:16 +00:00
Yifu Wang	c3d61bd367	[SymmetricMemory] allow overlapping devices for testing (#140127 ) When `TORCH_SYMM_MEM_ALLOW_OVERLAPPING_DEVICES` is set, the check for overlapping devices and multicast support will be disabled. This is useful for testing with a single device. Making this is an env var instead of an API argument since this is likely only useful for testing. Pull Request resolved: https://github.com/pytorch/pytorch/pull/140127 Approved by: https://github.com/lw	2024-11-14 23:29:16 +00:00
Aki	9c818c880f	[torchgen] Improve schema parsing with regex for numeric ranges (#140210 ) Replaces the hardcoded string replacement for numeric ranges with a more robust regex pattern that handles any combination of positive and negative numbers in default value ranges. Fixes #135470 Pull Request resolved: https://github.com/pytorch/pytorch/pull/140210 Approved by: https://github.com/ezyang	2024-11-14 23:28:27 +00:00
cyy	e90888a93d	[8/N] Fix Wextra-semi warning (#140697 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/140697 Approved by: https://github.com/ezyang	2024-11-14 23:08:04 +00:00
Natalia Gimelshein	05c3330893	use more elements per thread for narrow dtypes (#139449 ) Fix perf issue for narrow type by accessing more elements per thread Pull Request resolved: https://github.com/pytorch/pytorch/pull/139449 Approved by: https://github.com/Chillee, https://github.com/eqy	2024-11-14 22:50:16 +00:00
Xinya Zhang	7621fc5dad	Add missing boundary checks to cunn_SoftMaxForward (#140682 ) This fixes OOB memory access for following code ```python import torch qk = torch.randn((1024,587), dtype=torch.float64, device='cuda') smqk = torch.softmax(qk, dim=-1) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/140682 Approved by: https://github.com/jeffdaily, https://github.com/malfet	2024-11-14 22:49:06 +00:00
PyTorch MergeBot	c1fe6be202	Revert "[dynamo] add SymNode bitwise and/or (#138777 )" This reverts commit c98ef0279e6eb968f5f9d22e1f193e7064594152. Reverted https://github.com/pytorch/pytorch/pull/138777 on behalf of https://github.com/ezyang due to triggering AssertionError: Guard check failed: 14/2: name 'BitwiseFn_bitwise_or' is not defined ([comment](https://github.com/pytorch/pytorch/pull/138777#issuecomment-2477477776))	2024-11-14 21:52:40 +00:00
Siddharth Kotapati	d751b271b5	Torchbench nightly MPS runs (#135386 ) Add a workflow to run TorchBench with nightly mps builds & upload performance data to the HUD Solves: https://github.com/pytorch/pytorch/issues/115201 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135386 Approved by: https://github.com/DenisVieriu97, https://github.com/kulinseth, https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com> Co-authored-by: Huy Do <huydhn@gmail.com>	2024-11-14 21:50:23 +00:00
Shivam Raikundalia	f57ef5ddf2	Update Kineto Submodule (#140629 ) Summary: Update Submodule from Oct 10, 2024 to Nov 13, 2024 Test Plan: CI Passes Differential Revision: D65915865 Pull Request resolved: https://github.com/pytorch/pytorch/pull/140629 Approved by: https://github.com/ngimel, https://github.com/Skylion007, https://github.com/briancoutinho	2024-11-14 21:23:59 +00:00
Gauri Sahnan	2ea2c89675	Fixes the manylinux_2_28 docker image to build PyTorch on Aarch64 (#137696 ) This change provides the openblas support to the Docker image manylinux_2_28. - It allows us to build pytorch using manylinux_2_28. - Using this image in PyTorch builds provides the major perf improvements when tested torch bench models. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137696 Approved by: https://github.com/snadampal, https://github.com/atalman	2024-11-14 21:09:53 +00:00
Zain Rizvi	3424ca378f	[Inductor efficiency] Move less critical Inductor jobs to periodic (#140466 ) Moves jobs that don't have to be run as frequently to the inductor-periodic workflow, based on the priorities given by @desertfire Pull Request resolved: https://github.com/pytorch/pytorch/pull/140466 Approved by: https://github.com/huydhn, https://github.com/zxiiro, https://github.com/desertfire	2024-11-14 21:09:06 +00:00
Nichols A. Romero	27c7caf745	[ROCm] TunableOp fix for batched MM with views. (#140673 ) Fixes #140278 Pull Request resolved: https://github.com/pytorch/pytorch/pull/140673 Approved by: https://github.com/jeffdaily	2024-11-14 20:22:12 +00:00
Shangdi Yu	8094b19620	Fix _out_spec (#140608 ) Summary: The gm_torch_level can be a _LazyGraphModule(GraphModule) instead of a GraphModule. When we call .recompile(), GraphModule populates the self._out_spec, but _LazyGraphModule(GraphModule).recompile() doesn't populate it. Test Plan: CI Differential Revision: D65902135 Pull Request resolved: https://github.com/pytorch/pytorch/pull/140608 Approved by: https://github.com/tugsbayasgalan	2024-11-14 20:09:30 +00:00
Roy Hvaara	b0d681417c	[MPS] Reintroduce support for convolutions with output_channels > 65536 (#140726 ) This reintroduces support for high channel sizes for convs. The guard for macOS versions < 15.1 is still present to prevent reintroducing #129207. I'm unsure about the specific macOS version support, but I'm assuming this was fixed in 15.1, and I'm relying on signals from ci for verification. I'm expecting the new test will fail for macOS versions < 15.1, and the old test will start failing for > 15.0. I've added xfails for this and extended the version helpers to support 15.1+. Fixes #140722 Pull Request resolved: https://github.com/pytorch/pytorch/pull/140726 Approved by: https://github.com/malfet	2024-11-14 20:09:01 +00:00
Nikita Shulga	cd6ace1d15	[EZ] Delete unused `xfailIfMacOS14_4Plus` (#140735 ) Issue was fixed by https://github.com/pytorch/pytorch/pull/130038 but decorator remained in place Pull Request resolved: https://github.com/pytorch/pytorch/pull/140735 Approved by: https://github.com/kit1980, https://github.com/atalman	2024-11-14 20:08:48 +00:00
Oguz Ulgen	65518fd9ef	Turn on triton bundler in OSS (#140600 ) Its been enabled internally, lets also push it out to OSS. Pull Request resolved: https://github.com/pytorch/pytorch/pull/140600 Approved by: https://github.com/masnesral	2024-11-14 20:02:15 +00:00
Bob Ren	c536903c3f	revert test repro logging (#140717 ) @ezyang noticed this exercises a multithreading bug that is causing tests to become disabled: ``` 2024-11-13T21:05:55.8363582Z inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCPU::test_comprehensive_fft_ihfftn_cpu_int32 /opt/conda/envs/py_3.9/lib/python3.9/site-packages/_pytest/threadexception.py:73: PytestUnhandledThreadExceptionWarning: Exception in thread Thread-3 2024-11-13T21:05:55.8364857Z 2024-11-13T21:05:55.8364974Z Traceback (most recent call last): 2024-11-13T21:05:55.8365491Z File "/opt/conda/envs/py_3.9/lib/python3.9/threading.py", line 980, in _bootstrap_inner 2024-11-13T21:05:55.8366003Z self.run() 2024-11-13T21:05:55.8366371Z File "/opt/conda/envs/py_3.9/lib/python3.9/threading.py", line 917, in run 2024-11-13T21:05:55.8366858Z self._target(self._args, *self._kwargs) 2024-11-13T21:05:55.8367518Z File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/fbscribelogger/__init__.py", line 176, in _run_event_loop 2024-11-13T21:05:55.8368189Z self.loop.run_until_complete(self.task) 2024-11-13T21:05:55.8368774Z File "/opt/conda/envs/py_3.9/lib/python3.9/asyncio/base_events.py", line 647, in run_until_complete 2024-11-13T21:05:55.8369348Z return future.result() 2024-11-13T21:05:55.8369980Z File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/fbscribelogger/__init__.py", line 214, in _worker 2024-11-13T21:05:55.8370603Z message = await asyncio.wait_for( 2024-11-13T21:05:55.8371090Z File "/opt/conda/envs/py_3.9/lib/python3.9/asyncio/tasks.py", line 442, in wait_for 2024-11-13T21:05:55.8371573Z return await fut 2024-11-13T21:05:55.8372156Z File "/opt/conda/envs/py_3.9/lib/python3.9/asyncio/queues.py", line 166, in get 2024-11-13T21:05:55.8372613Z await getter 2024-11-13T21:05:55.8374010Z RuntimeError: Task <Task pending name='Task-1' coro=<FbScribeLogger._worker() running at /opt/conda/envs/py_3.9/lib/python3.9/site-packages/fbscribelogger/__init__.py:214> cb=[_run_until_complete_cb() at /opt/conda/envs/py_3.9/lib/python3.9/asyncio/base_events.py:184]> got Future <Future pending> attached to a different loop 2024-11-13T21:05:55.8375366Z 2024-11-13T21:05:55.8375603Z warnings.warn(pytest.PytestUnhandledThreadExceptionWarning(msg)) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/140717 Approved by: https://github.com/ezyang, https://github.com/zxiiro	2024-11-14 19:51:52 +00:00
Sam Larsen	f6ba95a76f	[inductor] PyCodeCache: only delete on-disk artifacts if purge=True (#140216 ) Summary: https://github.com/pytorch/pytorch/pull/136505 changed the cache_clear operation to remove loaded modules from disk. That change caused some problems with TORCHINDUCTOR_FORCE_DISABLE_CACHES=1, where there are some code paths (coordinate descent tuning at least), where we call `PyCodeCache.load_by_key_path` and expect that the files are still on disk. (But when caches are disabled, we call cache_clear before every inductor compile). It seems we probably have a shortcoming in the disable-cache logic, but since we also have flakey test failures with the same `'could not get source code'` error, let's restore the previous functionality until I can investigate further. Since some tests actually _DO_ want to delete on-disk artifacts (e.g., to test remote caching), then I added a `purge` param to optionally delete files Pull Request resolved: https://github.com/pytorch/pytorch/pull/140216 Approved by: https://github.com/eellison	2024-11-14 19:34:57 +00:00
Eli Uriegas	7702da9ce6	ci: Remove --progress-bar fallback for pip (#140189 ) All versions of pip that we currently support should have this flag so removing this should essentially be a no-op. Also put the actual command into a variable so we only have to change it once next time instead of changing it in 3 places. Pull Request resolved: https://github.com/pytorch/pytorch/pull/140189 Approved by: https://github.com/kit1980, https://github.com/malfet	2024-11-14 19:26:41 +00:00
PyTorch MergeBot	222d4b48b1	Revert "cpp_wrapper_cpu: Ensure reinterpret_view results in RAIIAtenTensorHandle (#139411 )" This reverts commit 761b42bc085190e272a930847694e872d92a1255. Reverted https://github.com/pytorch/pytorch/pull/139411 on behalf of https://github.com/kit1980 due to breaking internal inductor test ([comment](https://github.com/pytorch/pytorch/pull/139411#issuecomment-2477235367))	2024-11-14 19:25:46 +00:00
PyTorch MergeBot	25048e5381	Revert "Enable all fixed cpp_wrapper tests (#139412 )" This reverts commit fef16fe254da2f9598c6f8bb19fdd883e5a54971. Reverted https://github.com/pytorch/pytorch/pull/139412 on behalf of https://github.com/kit1980 due to breaking internal inductor test ([comment](https://github.com/pytorch/pytorch/pull/139411#issuecomment-2477235367))	2024-11-14 19:25:46 +00:00
PyTorch MergeBot	14641c0393	Revert "Fix broken AOTInductor node and kernel counts (#139435 )" This reverts commit 8cb0b932a16ee69137287b4e3872ffd39a79a8d4. Reverted https://github.com/pytorch/pytorch/pull/139435 on behalf of https://github.com/kit1980 due to breaking internal inductor test ([comment](https://github.com/pytorch/pytorch/pull/139411#issuecomment-2477235367))	2024-11-14 19:25:46 +00:00
Zain Rizvi	b69282c98c	Enable opting out of experiments even when they're being rolled out (#140433 ) Enables opting out of specific experiments in the runner determinator To opt out: 1. Go to the tracking issue: https://github.com/pytorch/test-infra/issues/5132 2. In the entry by your name, enter the experiment name, prefixed with a `-`. For example, to opt out of the LF fleet you could enter `@ZainRIzvi,-lf` This lets you simultaneously be opted into some experiments and opted out of others. While the `disable-runner-experiments` label offers an option to disable all experiments on a given PR, this one lets you disable a selected set of experiments across all your PRs. Fixes https://github.com/pytorch/pytorch/issues/138099 Pull Request resolved: https://github.com/pytorch/pytorch/pull/140433 Approved by: https://github.com/zxiiro, https://github.com/jeanschmidt	2024-11-14 19:18:24 +00:00
Sam Larsen	b11ff3cf60	[logging] Overhaul dynamo_timed and CompilationMetrics logging. (#139849 ) Here's the overview: There's a new contextmanager singleton called MetricsContext. Entering the MetricsContext is how we demarcate the boundary on which we'll create a single CompilationMetrics object, and therefore, a single dynamo_compile log entry. While we're inside the MetricsContext, we can update/set many different metrics. Most importantly: `dynamo_timed` can also update the in-progress MetricsContext. In the proposal here, we tell `dynamo_timed` that we want it to do so by providing the name of the MetricsContext field to increment. There can be many `dynamo_timed` calls in different parts of the code updating different fields. Then when the MetricsContext exits, that's when the logging of everything gathered finally happens. One potential footgun is trying to use `dynamo_timed` when we haven't entered the MetricsContext, but we assert on that problem. Another problem is that we re-enter the context recursively, but we watch for that and do the logging only when the outermost exits. Some specifics: * Introduce MetricsContext - a context manager that on exit, records the CompilationMetrics (which also logs to dynamo_compile). * Completely remove the concept of frame_phase_timing. Instead, update the MetricsContext during compilation, either directly or via dynamo_timed. * Remove some globals we previously used to accumulate counters to later populate a CompilationMetrics. We use CompilationMetrics set/update/increment APIs instead. * `record_compilation_metrics` is now called on exit from MetricsContext. * Populate legacy CompilationMetrics fields right before logging, inside `record_compilation_metrics`. * Remove the one-off `add_remote_cache_time_saved` helper; capture that timing directly into the MetricsContext. And specifically, several changes to dynamo_timed: * "Modernize" the parameters and update all callsites accordingly. * Move the backwards logging of the CompilationMetrics to the backwards compile location. * Add a parameter for which CompilationMetrics field to update Pull Request resolved: https://github.com/pytorch/pytorch/pull/139849 Approved by: https://github.com/ezyang	2024-11-14 19:11:20 +00:00
Catherine Lee	ea7d1826a2	[ez] Make merge blocking sevs be based on label instead of string (#140636 ) sev issues are now merge blocking if they are labeled merge blocking, instead of simply having the merge blocking string in the body. This makes it easier to default to non merge blocking when creating a sev Pull Request resolved: https://github.com/pytorch/pytorch/pull/140636 Approved by: https://github.com/huydhn, https://github.com/ZainRizvi	2024-11-14 19:02:27 +00:00
sdp	83b6d91d08	[Intel GPU] Add NestedTensorXPU to parseDispatchKey and codegen (#140461 ) Add `NestedTensorXPU` dispatch key. ``` >>> nt = torch.nested.nested_tensor([]).to("xpu") >>> nt nested_tensor([ ], device='xpu:0') >>> nt.is_xpu True ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/140461 Approved by: https://github.com/guangyey, https://github.com/EikanWang, https://github.com/ezyang	2024-11-14 18:54:41 +00:00
Prajesh Praveen Anchalia	9ff368c270	[pytorch] Add logger for pt2 compile chromium events to hive (#139941 ) Summary: X-link: https://github.com/pytorch/benchmark/pull/2535 Logging raw chromium events to hive per job run enables us to build combined rank perfetto traces without having to depend on Logarithm and deal with things like rate limits etc. We can easily build a utility to query hive and upload traces to manifold and view them on perfetto Test Plan: Launch a job ``` buck2 run mode/opt //aps_models/examples/dlrm:dlrm_train_app -- --config-name train_mast_fsdp_torchdynamo launcher.data_project=apf_ai_infra launcher.fbl_entitlement=ai_infra_training_rnd_tc launcher.hardware=TC_ANY_80G ``` Local run ``` Perfetto: ['https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html?url=https://interncache-all.fbcdn.net/manifold/pt2_compile_traces_test/tree/pt2_trace_files/aps-ppanchalia-426838c277/0/0/2bc9975d-921c-4766-9cb2-e7ce9833ae96.json'] ``` {F1954710538} Differential Revision: D65525513 Pull Request resolved: https://github.com/pytorch/pytorch/pull/139941 Approved by: https://github.com/jamesjwu	2024-11-14 18:27:38 +00:00
Nikita Shulga	50ab68fa22	[EZ] Make lintrunner usable with Python-3.12 and 3.13 (#140721 ) By installing numpy-2.1 as 1.26 is available up to Python-3.11 And restricting torch fix to python older than 3.13, as TorchFix depends on libcstd-1.2 and therefore can not be installed to 3.13, see https://github.com/pytorch-labs/torchfix/issues/84 Pull Request resolved: https://github.com/pytorch/pytorch/pull/140721 Approved by: https://github.com/Skylion007, https://github.com/atalman, https://github.com/ZainRizvi	2024-11-14 17:52:05 +00:00
Brad Hilton	879e273601	fix: Add type annotation to _record_memory_history (#140545 ) Pylance infers the type of the first argument (`enabled`) to `_record_memory_history` as `str` even though the function accepts `Literal[None, "state", "all"]`. This raises an issue when passing `None`, even though it is a legitimate argument. This PR addresses the issue by adding the type annotation in the doc string. Pull Request resolved: https://github.com/pytorch/pytorch/pull/140545 Approved by: https://github.com/Skylion007 Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>	2024-11-14 17:44:46 +00:00
PyTorch MergeBot	adcff4bff0	Revert "use more elements per thread for narrow dtypes (#139449 )" This reverts commit d3fc13a9dd186ceb8d1b56b0968a41686ea645cd. Reverted https://github.com/pytorch/pytorch/pull/139449 on behalf of https://github.com/ngimel due to breaks tests ([comment](https://github.com/pytorch/pytorch/pull/139449#issuecomment-2477012582))	2024-11-14 17:28:32 +00:00
xinan.lin	f4008a5ce4	[AOTI XPU] Remove workarounds after update torch-xpu-ops that extend c_shim_xpu layer with out-of-tree ATen OPs. (#139026 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139026 Approved by: https://github.com/EikanWang, https://github.com/desertfire	2024-11-14 17:14:58 +00:00
Zhengxu Chen	add6bb2e96	[aps] skip version check for export IR. (#140573 ) Summary: mitigating potential export compatibility issue for production (temporarily). Test Plan: CI Differential Revision: D65890958 Pull Request resolved: https://github.com/pytorch/pytorch/pull/140573 Approved by: https://github.com/desertfire	2024-11-14 17:13:42 +00:00
Bin Bao	dcf22fa58c	[AOTI][refactor] Add sizes and strides util functions (#140449 ) Summary: Similar to https://github.com/pytorch/pytorch/pull/139895, add sizes and strides methods to RAIIAtenTensorHandle and ConstantHandle, to increase the code readability. Pull Request resolved: https://github.com/pytorch/pytorch/pull/140449 Approved by: https://github.com/chenyang78 ghstack dependencies: #140447, #140448	2024-11-14 16:48:43 +00:00
Zhengxu Chen	3ef2dfc1ba	[export] Implement cpp deserializer. (#136398 ) Differential Revision: D63206258 This diff introduces a mechanism to generate a json-compatible deserializer in cpp using nlohmann json (already being used by AOTI). Why we need this? Because there will be a lot of cases where people don't want to use Python to load the graph (e.g. cpp runtime), and instead they can use this header to deserialize the JSON graph. Every time we call update_schema.py to update the schema, the header will be auto generated and included into the source files. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136398 Approved by: https://github.com/angelayi	2024-11-14 16:34:59 +00:00
Laith Sakka	f98c601efe	Avoid logging zeros (#139968 ) Summary: title Test Plan: NA Differential Revision: D65582953 Pull Request resolved: https://github.com/pytorch/pytorch/pull/139968 Approved by: https://github.com/zou3519	2024-11-14 15:46:49 +00:00
Yukio Siraichi	216b6a952c	`triangular_solve`: fix meta function output argument dtype check. (#140286 ) Tracking issue: #138399 Pull Request resolved: https://github.com/pytorch/pytorch/pull/140286 Approved by: https://github.com/ezyang ghstack dependencies: #140186	2024-11-14 15:25:14 +00:00
Aaron Gokaslan	72c6d13cea	[BE]: Use proper logger in torch.distributed.run (#140547 ) `torch.distributed.run` was improperly using the root logger and ignoring all logging settings and useful debugging info. Now properly uses the correct logger. Will be added to ruff as part of LOG015 soon. Pull Request resolved: https://github.com/pytorch/pytorch/pull/140547 Approved by: https://github.com/XuehaiPan, https://github.com/fegin	2024-11-14 14:49:17 +00:00
Thomas J. Fan	1c669e7c4e	Document the parameter (hx) that RNN actually uses (#140575 ) Fixes https://github.com/pytorch/pytorch/issues/136925 This PR updates the docs to use `hx`, which is the parameter actually used by `RNN`: `629c243c82/torch/nn/modules/rnn.py (L650)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/140575 Approved by: https://github.com/ezyang	2024-11-14 14:45:17 +00:00
Yu, Guangye	ebeab262d9	Refine XPU device prop and fix typo (#140661 ) # Motivation `architecture` is an experimental attribute that might been used by triton AOT codegen. It should not be in `__repr__`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/140661 Approved by: https://github.com/EikanWang	2024-11-14 11:18:01 +00:00
Zhenbin Lin	9a051f6ee0	OpenReg: Fix issue when creating empty tensor (#140496 ) On the exeuctor side, when it is found that meta.data_ptr is not in the allocated memory, tensor creation will fail, but there is no need to allocate memory when creating an empty tensor. Pull Request resolved: https://github.com/pytorch/pytorch/pull/140496 Approved by: https://github.com/ezyang	2024-11-14 11:10:37 +00:00
Laith Sakka	aaefa48441	reduce the threshold to change exisiting data suggestion to noise/3 (#140623 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/140623 Approved by: https://github.com/bobrenjc93	2024-11-14 06:29:25 +00:00
Xia, Weiwen	62eea62493	[Quant][Onednn] add linear_dynamic_fp16 ops (#140376 ) About this PR This PR adds the following ops for `linear_dynamic_fp16` in onednn namespace. These ops are intended for PT2E quantization eager mode. - `onednn::linear_prepack_fp16`: packs fp32 weight to an fp16 MkldnnCPU tensor. - `onednn::linear_dynamic_fp16`: takes an fp32 CPU tensor and an fp16 MkldnnCPU tensor and compute linear in fp32 - `onednn::linear_relu_dynamic_fp16`: similar as the former and apply relu on output. Test plan `python test/test_quantization.py -k test_linear_dynamic_fp16_onednn` Implementation These ops call oneDNN lib under the hood. It's worth noting that oneDNN does not support f32 * f16 -> f32 computation, so we have to convert fp16 weight to fp32 before computation. And weight is still in plain format after packing. Correctness and performance Correctness is guaranteed by UT. Performance of the new ops may be better than the FBGEMM implementation when weight shape is small but worse when weight shape is large. It's because weight dtype conversion and computation are not fused. For example, I ran benchmarks on an Intel(R) Xeon(R) Platinum 8490H machine with different cores and shapes. When using 1 core per instance, the new implementation generally is faster for weight shape < 1024 * 1024. When using more cores, the threshold will increase. Pull Request resolved: https://github.com/pytorch/pytorch/pull/140376 Approved by: https://github.com/jerryzh168, https://github.com/jgong5	2024-11-14 05:19:18 +00:00
Nikita Shulga	99c8d5af27	Don't pass credentials explicitly to sccache (#140611 ) sccache-0.2.14 can query it thru IMDSv1 and sccache-0.8.2 can do it thru v2 (or may be just use trust relationships between host and bucket Pull Request resolved: https://github.com/pytorch/pytorch/pull/140611 Approved by: https://github.com/wdvr	2024-11-14 04:44:55 +00:00
Haifeng Jin	e6083016b3	fix test_float_to_int_conversion_nonfinite for NumPy 2 (#138131 ) Related to #107302 We saw `test_float_to_int_conversion_nonfinite` failed as we upgrade to NumPy 2. It is caused by the undefined behavior of `numpy` casting `inf`, `-inf` and `nan` from `np.float32` to other dtypes. The test is using NumPy as reference for the ground truth. (see line 1013-1015) However, these behaviors are undefined in NumPy. If you do `np.array([float("inf")]).astype(np.uint8, casting="safe")`, it results in an error `TypeError: Cannot cast array data from dtype('float64') to dtype('uint8') according to the rule 'safe'`. The undefined behaviors are always subject to change. This PR address this issue by passing concrete values as the ground truth references. In the future, even NumPy changes its behavior the test would still remain stable. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138131 Approved by: https://github.com/drisspg	2024-11-14 04:19:19 +00:00
Oguz Ulgen	d32eac86f3	Put a compile lock around backward compile (#140626 ) Summary: https://fb.workplace.com/groups/1286739428954016/posts/1370274947267130 Test Plan: ``` hg up b5b5adce34 vizard_projects/ml_depth/scripts/run_mld.sh ``` used to crash, no longer crashes Differential Revision: D65913100 Pull Request resolved: https://github.com/pytorch/pytorch/pull/140626 Approved by: https://github.com/ezyang	2024-11-14 04:07:46 +00:00
xinan.lin	3ce75e7ea6	[Inductor UT] Fix duplicate registration of custom ops amount test cases (#140540 ) Fix #140537 Pull Request resolved: https://github.com/pytorch/pytorch/pull/140540 Approved by: https://github.com/EikanWang, https://github.com/jansel ghstack dependencies: #140517	2024-11-14 03:36:20 +00:00
xinan.lin	8d3a07e321	[Inductor UT] Skip test_decompose_mem_bound_mm.py for XPU since we have not enabled decompose_mem_bound_mm for XPU. (#140517 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/140517 Approved by: https://github.com/EikanWang, https://github.com/jansel	2024-11-14 03:36:20 +00:00
titaiwangms	b1d6250028	[ONNX] Use TracedONNXFunction op signature to promote inputs to tensors (#138770 ) Previous to this PR, in torchlib TracedONNXFunction, the inputs could be python constants even if the annotation sets to TensorTypes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138770 Approved by: https://github.com/justinchuby	2024-11-14 03:15:07 +00:00
PyTorch UpdateBot	77da0509c4	[executorch hash update] update the pinned executorch hash (#139588 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned executorch hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139588 Approved by: https://github.com/pytorchbot, https://github.com/huydhn Co-authored-by: Huy Do <huydhn@gmail.com>	2024-11-14 02:10:37 +00:00
Nikita Shulga	c6c0554394	[EZ] Delete `linux-focal-cuda12_1-py3_10-gcc9-bazel-test` (#140659 ) Because there is `linux-focal-cuda12_1-py3_10-gcc9-bazel-test` Not sure what the purpose of testing it against 2 CUDA versions as very basic things are tested right now Pull Request resolved: https://github.com/pytorch/pytorch/pull/140659 Approved by: https://github.com/atalman, https://github.com/huydhn	2024-11-14 02:00:45 +00:00
Bin Bao	80870f62f0	[AOTI][refactor] Switch remaining aoti_torch_get_data_ptr (#140448 ) Summary: https://github.com/pytorch/pytorch/pull/139895 added data_ptr(), but there is a remaining place in cpp_wrapper_gpu.py didn't switch over. Also moved a few AtenTensorHandle related utility functions from arrayref_tensor.h to utils.h. Pull Request resolved: https://github.com/pytorch/pytorch/pull/140448 Approved by: https://github.com/chenyang78 ghstack dependencies: #140447	2024-11-14 01:40:59 +00:00
Bin Bao	85deef9ede	[AOTI][refactor] Rename generate_extern_kernel_alloc_and_find_schema_if_needed (#140447 ) Summary: Rename generate_extern_kernel_alloc_and_find_schema_if_needed to better reflect its meaning. Pull Request resolved: https://github.com/pytorch/pytorch/pull/140447 Approved by: https://github.com/chenyang78	2024-11-14 01:40:58 +00:00
Fabian Marin	e2b7f0bfd2	clarifies the wording in the main README to make it clearer that visu… (#140442 ) …al studio build tool is only needed for Windows I created no issue since the suggested change is actually very small. This is my very first PR so partly I am creating it just to dip my toes in the water. In fact I would understand if the change does not get accepted since it's a simple modification to part of the wording in the README. The wording as it currently stands is probably clear enough for most people, but I still missed the fact that visual studio build tool must only be installed for Windows (even though that is stated there), and I thought by adding some parentheses this might become even more clear, specially since elsewhere in the README the formatting makes it more explicit that some steps must only be run for Windows/Linux/MacOS As I said, it's a trivial change so I'd understand if it's not accepted, and I am looking forward to making more meaningful contributions as time goes on. Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/140442 Approved by: https://github.com/soulitzer	2024-11-14 00:35:55 +00:00
atalman	70acf02116	Use Manylinux2_28 for wheel builds (#138732 ) Fixes https://github.com/pytorch/pytorch/issues/123649 Use Manylinux 2_28 Docker builds for PyTorch Nightly builds This moves the wheels to a Docker image that uses : ``quay.io/pypa/manylinux_2_28_x86_64`` as a base rather then ``centos:7`` which is EOL on June 30, 2024. Information: https://github.com/pypa/manylinux#manylinux_2_28-almalinux-8-based manylinux_2_28 (AlmaLinux 8 based) Toolchain: GCC 13 Built wheels are also expected to be compatible with other distros using glibc 2.28 or later, including: Debian 10+ Ubuntu 18.10+ Fedora 29+ CentOS/RHEL 8+ This migration should enable us to migrate to latest CUDNN version, and land this PR: https://github.com/pytorch/pytorch/pull/137978 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138732 Approved by: https://github.com/Skylion007, https://github.com/malfet, https://github.com/huydhn	2024-11-14 00:25:47 +00:00
Justin Chu	f85e4338d4	[ONNX] Remove the contiguous patch (#140428 ) Remove the contiguous patch because it is no longer needed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/140428 Approved by: https://github.com/titaiwangms	2024-11-14 00:03:17 +00:00
Huy Do	9c75475c77	Add missing pytorch-linux-jammy-py3.12-triton-cpu Docker image (#140571 ) When investigating the burst of 429 rate limit failures from docker.io yesterday, I found out that ` pytorch-linux-jammy-py3.12-triton-cpu` hasn't been added to docker build workflow at all. The bad effect is that the image is rebuilt on every job https://github.com/pytorch/pytorch/actions/runs/11808772774/job/32900628381 Pull Request resolved: https://github.com/pytorch/pytorch/pull/140571 Approved by: https://github.com/seemethere, https://github.com/wdvr	2024-11-13 23:49:31 +00:00
Yutao Xu	f1e045eb75	Update torch-xpu-ops commit pin (#140277 ) Update the torch-xpu-ops commit to [01f4e29](`01f4e293fa`), includes: - Improve XPU operator coverage - Fix `Werror=comments` relevant building issues Pull Request resolved: https://github.com/pytorch/pytorch/pull/140277 Approved by: https://github.com/EikanWang, https://github.com/atalman	2024-11-13 23:38:51 +00:00
Basil Wong	2f1dbfea02	Logging Refactor - Remove Print Statements (#139782 ) Summary: Removes print statements and implements logging via the logging library. Hopefully this will allow more control on the level of logging when running models. Test Plan: ``` AOT_PARTITIONER_DEBUG=1 buck2 run @mode/opt //aps_models/ads/icvr:icvr_launcher -- mode=local_fb_fm_v4 launcher.num_workers=2 ``` Resulting output paste: P1674535630 * Full logs paste: P1674535621 ``` pastry P1674535621 \| grep "functorch/partitioners.py" \| pastry ``` Logging results: P1674549514 Differential Revision: D61678215 Pull Request resolved: https://github.com/pytorch/pytorch/pull/139782 Approved by: https://github.com/paryxyt, https://github.com/jansel	2024-11-13 23:09:18 +00:00
Antonio Kim	b34bb1f562	Add support for parsing torch.Generator in JIT (#140489 ) Fixes #140420 Pull Request resolved: https://github.com/pytorch/pytorch/pull/140489 Approved by: https://github.com/davidberard98	2024-11-13 23:06:57 +00:00
Antonio Kim	70060b0927	Add proper parse_tensor_constants support (#140558 ) Fixes #140422 Pull Request resolved: https://github.com/pytorch/pytorch/pull/140558 Approved by: https://github.com/davidberard98	2024-11-13 23:06:26 +00:00
Nikita Shulga	9d93c27025	Implement unfold_backward on MPS (#135411 ) This PR adds native implementation of unfold_backward as metal shader, mostly copy-n-paste of algorithms used in CUDA and CPU implementations, i.e. considering `out = in.unfold(dim, size, step)`, then following holds true: * `out.shape[dim] == (in.shape[dim] - size) / step + 1` * `out.shape[-1] == size` * `out.ndim == in.ndim + 1` `unfold_backward` Metal kernel receives `grad_in` and returns `grad_out` such that: * `grad_in.shape == out.shape` * `grad_out.shape == in.shape` For each index in `grad_out` find the elements contributing to it and sum them up. Such algorithm requires no synchronization between threads. That is `grad_out[...,out_dim_idx,...]` accumulates all values `grad_in[...,in_dim_idx,...,in_last_idx]`, where `in_dim_idx` is range [`(out_dim_idx - size) / step`, `out_dim_idx / step`] clamped to (0, `in_dim_size`) and `in_last_idx` are equal `out_dim_idx - in_dim_idx * step` . Accumulation step is skipped if `in_last_idx` is outside of [0, size] range. This operator has been requested 16 times on https://github.com/pytorch/pytorch/issues/77764 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135411 Approved by: https://github.com/manuelcandales Co-authored-by: Manuel Candales <42380156+manuelcandales@users.noreply.github.com>	2024-11-13 23:04:15 +00:00
Catherine Lee	08acfcddc4	[ez] Fix check labels error when deleting comment (#140578 ) Re make of https://github.com/pytorch/pytorch/pull/140587 Pull Request resolved: https://github.com/pytorch/pytorch/pull/140578 Approved by: https://github.com/huydhn	2024-11-13 23:00:58 +00:00
Zejun Huang	274f4cfacb	[3/x][fx minimizer] Support all_outputs in minimizer (#139774 ) Summary: output nodes may be eliminated to the input nodes if only partial output nodes are specified. add option to check results for all output nodes in the partitioned graph Test Plan: see D65367305 Reviewed By: qcyuan Differential Revision: D65367305 Pull Request resolved: https://github.com/pytorch/pytorch/pull/139774 Approved by: https://github.com/jfix71	2024-11-13 22:56:42 +00:00
Oguz Ulgen	26fde110db	Refactor user-defined triton kernel source code collection (#140577 ) Differential Revision: [D65895743](https://our.internmc.facebook.com/intern/diff/D65895743) Pull Request resolved: https://github.com/pytorch/pytorch/pull/140577 Approved by: https://github.com/zou3519	2024-11-13 22:12:17 +00:00
Zhenbin Lin	a8de84998d	OpenReg: Export the number of devices (#140492 ) Export the number of devices so that it can be used in ut. Pull Request resolved: https://github.com/pytorch/pytorch/pull/140492 Approved by: https://github.com/ezyang	2024-11-13 22:08:37 +00:00
Shivam Raikundalia	c1bf714d76	[Profiler] Fix ASAN Overflow Issues (#140441 ) Summary: It seems like this issues is due to leftover cupti events during warmup staying persistent in the queue during profiling. These events start before our actual time window and therefore have a timestamp lower than our basetime. This makes the delta become negative which results in unsigned overflow. This then creates a large number which later gets sign added which creates the signed overflow. Solution: If a raw timestamp is less than the base timestamp, just mark the process timestamp as -1 so we can mark these events as "to ignore". In Kineto, add a special case to ignore timestamps that are negative. Test Plan: Test with ASAN Differential Revision: D65835650 Pull Request resolved: https://github.com/pytorch/pytorch/pull/140441 Approved by: https://github.com/davidberard98	2024-11-13 21:30:32 +00:00
fduwjj	ba8568f7fb	[c10d][logging] Add wait counter for time spent in object to tensor and tensor to object (#140414 ) Originally we want to leverage the timer logger to measure the time spent in object to tensor and tensor to object (https://github.com/pytorch/pytorch/pull/139757) But it gets reverted (internally) because of a performance regression. We now use wait counter instead which is more lightweight. Pull Request resolved: https://github.com/pytorch/pytorch/pull/140414 Approved by: https://github.com/c-p-i-o, https://github.com/XilunWu, https://github.com/wz337	2024-11-13 21:10:43 +00:00
Brian Hirsh	49c124fe1b	dynamo: guard on FSDP module parameters (#138819 ) Fixes https://github.com/pytorch/pytorch/issues/138715 It looks like we were previously ignoring guards on FSDP module parameters. In the issue linked above, this was causing inductor size/stride asserts to fire. The root cause is that for some code like this: ``` m = FSDP( torch.nn.Sequential( torch.compile(torch.nn.Linear(1024, 1024)), torch.compile(torch.nn.Linear(1024, 4096)) ) ) ``` We need to generate two different graphs for the two linear layers, and it looks like without a `TENSOR_MATCH` guard on the linear parameters, dynamo would think that it could re-use the same graph across both layers. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138819 Approved by: https://github.com/anijain2305	2024-11-13 20:46:46 +00:00
Richard Barnes	c8be6f1196	[codemod] Remove unused-variable in pytorch (#140569 ) Summary: LLVM-15 has a warning `-Wunused-variable` which we treat as an error because it's so often diagnostic of a code issue. Unused variables can compromise readability or, worse, performance. This diff either (a) removes an unused variable and, possibly, it's associated code or (b) qualifies the variable with `[[maybe_unused]]`. #buildsonlynotests - Builds are sufficient - If you approve of this diff, please use the "Accept & Ship" button :-) Test Plan: Sandcastle Reviewed By: meyering Differential Revision: D65833225 Pull Request resolved: https://github.com/pytorch/pytorch/pull/140569 Approved by: https://github.com/Skylion007	2024-11-13 20:38:03 +00:00
Aaron Orenstein	82597d07aa	type annotations for meta_utils (#140203 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/140203 Approved by: https://github.com/ezyang	2024-11-13 20:07:47 +00:00
PyTorch MergeBot	c25999bdc0	Revert "Add missing pytorch-linux-jammy-py3.12-triton-cpu Docker image (#140571 )" This reverts commit 51e0996d58e6fa40a8d255a26b767c3f3e035943. Reverted https://github.com/pytorch/pytorch/pull/140571 on behalf of https://github.com/huydhn due to Not sure why lint fails, maybe a landrace ([comment](https://github.com/pytorch/pytorch/pull/140571#issuecomment-2474627883))	2024-11-13 19:54:11 +00:00
Nikita Shulga	0f739b8f66	[Codemod] `skipIfMps`->`skipIfMPS` (#140562 ) As `MPS` is an acronym that stands for Metal Performance Shaders Also to closer align with `skipCUDAIf` not `skipCudaIf` Pull Request resolved: https://github.com/pytorch/pytorch/pull/140562 Approved by: https://github.com/ZainRizvi, https://github.com/r-barnes	2024-11-13 19:45:08 +00:00
Adnan Akhundov	f3a6832b09	[inductor] Skip autotuning config on ptxas error (#140495 ) Currently, when ptxas errors occur in one of the autotuning configs, we error out. This doesn't match the newly introduced behavior of the native Triton ([here](`915c149978/python/triton/runtime/autotuner.py (L164)`)). In this PR, we match the Inductor's autotuning behavior to native Triton's by ignoring the ptxas errors and the configs triggering thereof. This unblocks PT2 compilation of an internal model. Differential Revision: D65861236 Pull Request resolved: https://github.com/pytorch/pytorch/pull/140495 Approved by: https://github.com/chenyang78	2024-11-13 19:45:00 +00:00
Huy Do	51e0996d58	Add missing pytorch-linux-jammy-py3.12-triton-cpu Docker image (#140571 ) When investigating the burst of 429 rate limit failures from docker.io yesterday, I found out that ` pytorch-linux-jammy-py3.12-triton-cpu` hasn't been added to docker build workflow at all. The bad effect is that the image is rebuilt on every job https://github.com/pytorch/pytorch/actions/runs/11808772774/job/32900628381 Pull Request resolved: https://github.com/pytorch/pytorch/pull/140571 Approved by: https://github.com/seemethere, https://github.com/wdvr	2024-11-13 19:08:14 +00:00
PyTorch MergeBot	d63eb3c46c	Revert "[logging] Overhaul dynamo_timed and CompilationMetrics logging. (#139849 )" This reverts commit cb15c1515778499ae801dcf67d55c8bdab4724ef. Reverted https://github.com/pytorch/pytorch/pull/139849 on behalf of https://github.com/kit1980 due to Breaking an internal tests + there is a bug according to the author ([comment](https://github.com/pytorch/pytorch/pull/139849#issuecomment-2474459094))	2024-11-13 18:47:51 +00:00
haozhe.zhu	42622cf7d5	enable concat linear with mkldnn linear by flag (#139048 ) Enable concat linear for CPU mkldnn path. Previously, we have a concat linear in freezing passes but it not worked on CPU. This is because `concat_linear` pattern happened after `mkldnn_weight_prepack`. And `concat_linear` only handle `addmm/mm` etc. ``` addmm -> mkldnn linear addmm -> mkldnn linear -> cannot concat # only worked when disable mkldnn addmm -> addmm -> concat linear ``` Now we changed `mkldnn linear` related pass numbers larger than `concat_linear` pass numbers. ``` addmm -> concat linear -> mkldnn linear addmm -> ``` So it can work fine with mkldnn linear now. Also, since concat linear not always have benefits. We add 1 flag `config.cpp.enable_concat_linear` and set default value to False. User can enable this by their need. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139048 Approved by: https://github.com/jgong5, https://github.com/jansel	2024-11-13 18:43:37 +00:00
William Wen	c98ef0279e	[dynamo] add SymNode bitwise and/or (#138777 ) Fixes [T203472723](https://www.internalfb.com/intern/tasks/?t=203472723) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138777 Approved by: https://github.com/ezyang	2024-11-13 18:31:06 +00:00
William Wen	22dfb5b6cf	[dynamo, 3.13] replace deprecated PyWeakref_GetObject (#140187 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/140187 Approved by: https://github.com/jansel	2024-11-13 17:57:28 +00:00
Vincent Moens	03cccaa76a	Doc: Rewrite the storage.rst file to emphasize untyped storages (#140145 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/140145 Approved by: https://github.com/janeyx99	2024-11-13 17:40:16 +00:00
David Berard	1a8752bc7d	[TorchScript] bindings for torch._C.ClassType.method_names() (#140444 ) I used this for debugging, figured I'd upstream it. This gives you a list of the method names provided by the given ClassType. Pull Request resolved: https://github.com/pytorch/pytorch/pull/140444 Approved by: https://github.com/eellison	2024-11-13 17:23:23 +00:00
PyTorch MergeBot	2675ef8758	Revert " [Environment Variable][5/N] Use thread-safe getenv functions (#139762 )" This reverts commit 43f0fe60a36dc7e3bd8f77a2451bde81496679b0. Reverted https://github.com/pytorch/pytorch/pull/139762 on behalf of https://github.com/malfet due to One of these diffs had incorrect downstream optional handling, we must reaudit all of these diffs ([comment](https://github.com/pytorch/pytorch/pull/139762#issuecomment-2474174813))	2024-11-13 16:50:00 +00:00
Daniel Kleine	3d618019fb	Fix RMSNorm Notation: Parentheses, Indices, Comma (#140215 ) Fixes #140165 * fixed mathematical notation for RMSNorm: * changed RMS function from brackets `[x]` to parenthesis `(x)` for consistency and align with mathematical notation standards for functions * added indices (e.g. `y_i`) for element-wise operations for the correctness in the context of tensor operations * added comma `,` before $$\text{where}$$ ![grafik](https://github.com/user-attachments/assets/47368625-d97a-43de-8b90-17b2c01cbe2f) Pull Request resolved: https://github.com/pytorch/pytorch/pull/140215 Approved by: https://github.com/mikaylagawarecki	2024-11-13 15:33:50 +00:00
PyTorch MergeBot	a58a565819	Revert "[Environment Variable][6/N] Use thread-safe getenv functions (#140200 )" This reverts commit 7d4f5f7508d3166af58fdcca8ff01a5b426af067. Reverted https://github.com/pytorch/pytorch/pull/140200 on behalf of https://github.com/ezyang due to One of these diffs had incorrect downstream optional handling, we must reaudit all of these diffs ([comment](https://github.com/pytorch/pytorch/pull/140200#issuecomment-2473956859))	2024-11-13 15:33:23 +00:00
PyTorch MergeBot	5dc6b8c19e	Revert "Allow NJT by default for weights_only torch.load (#140304 )" This reverts commit 1f28235ee2984dbad45b55aa65358b59a7aeea33. Reverted https://github.com/pytorch/pytorch/pull/140304 on behalf of https://github.com/mikaylagawarecki due to Breaking internal tests due to missing torch.nested._internal ([comment](https://github.com/pytorch/pytorch/pull/140304#issuecomment-2473928461))	2024-11-13 15:24:00 +00:00
PyTorch MergeBot	b4cc5d38b4	Revert "[aoti] Remove dir after packaging (#140022 )" This reverts commit ba136a78ba613d3c7f5d2de53b9fff556e04cfba. Reverted https://github.com/pytorch/pytorch/pull/140022 on behalf of https://github.com/angelayi due to sorry I realized I need to land from internal ([comment](https://github.com/pytorch/pytorch/pull/140022#issuecomment-2473814720))	2024-11-13 14:43:15 +00:00
Sam Larsen	a8a1e58e24	[inductor] Log how compile_threads is set (#139771 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139771 Approved by: https://github.com/eellison	2024-11-13 14:17:10 +00:00
PyTorch MergeBot	c6a29fc3d8	Revert "[Environment Variable][4/N] Use thread-safe getenv functions (#137843 )" This reverts commit 82eb09aafd7e4ee6e4fb0580f2221ea6253d218b. Reverted https://github.com/pytorch/pytorch/pull/137843 on behalf of https://github.com/ezyang due to One of these diffs had incorrect downstream optional handling, we must reaudit all of these diffs ([comment](https://github.com/pytorch/pytorch/pull/137843#issuecomment-2473709760))	2024-11-13 14:06:52 +00:00
PyTorch MergeBot	4a18e26ff5	Revert "[Environment Variable][7/N] Use thread-safe getenv functions (#140211 )" This reverts commit a3cff4bbd4130d36b188dbe101a790e6d7da644f. Reverted https://github.com/pytorch/pytorch/pull/140211 on behalf of https://github.com/ezyang due to One of these diffs had incorrect downstream optional handling, we must reaudit all of these diffs ([comment](https://github.com/pytorch/pytorch/pull/140211#issuecomment-2473709246))	2024-11-13 14:05:01 +00:00
hipudding	34743d8a16	Support dlpack for privateuse1 (#135331 ) Fixes #129652 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135331 Approved by: https://github.com/shink, https://github.com/FFFrog, https://github.com/ezyang Co-authored-by: Jiawei Li <ljw1101.vip@gmail.com>	2024-11-13 13:13:14 +00:00
PyTorch MergeBot	97d995a0d3	Revert "[pytorch/profiler] Profiler NCCL metadata can now contain collective Input and Ouput Tensor addrs (#139837 )" This reverts commit 3e277eb9febbbdd435e6a07a3f0750d4e362625a. Reverted https://github.com/pytorch/pytorch/pull/139837 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/139837#issuecomment-2473466607))	2024-11-13 12:26:43 +00:00
angelayi	ba136a78ba	[aoti] Remove dir after packaging (#140022 ) Update AOTI to return a list of files that it generates when `aot_inductor.package=True`. Then we will only package the files that are in that list. This should fix the [caching issue](https://fb.workplace.com/groups/1028545332188949/permalink/1081702043539944/) and hopefully https://github.com/pytorch/pytorch/issues/140053. Pull Request resolved: https://github.com/pytorch/pytorch/pull/140022 Approved by: https://github.com/larryliu0820, https://github.com/desertfire, https://github.com/malfet	2024-11-13 12:17:19 +00:00
angelayi	e754611d19	[aoti] Add error msg if we can't find a proxy executor (#140308 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/140308 Approved by: https://github.com/desertfire	2024-11-13 09:10:54 +00:00
Junjie Wang (PyTorch)	c61ccaf10e	[FR] Polish the log message for dtype mismatch and don't exit when too many mismatch (#140451 ) Summary: 1. We don't want to exit with exceptions when there are so many mismatches. We should just break and return. 2. Polish the message of dtype mismatch. This is because dtype of input/output is actually a list not a string. So we don't want to show a list of ['double'] in the output message. Test Plan: Testing on the case when we see too many collective dtype mismatch {F1958467224} Differential Revision: D65841830 Pull Request resolved: https://github.com/pytorch/pytorch/pull/140451 Approved by: https://github.com/c-p-i-o	2024-11-13 07:24:53 +00:00
zeshengzong	cb71bcc542	Replace clone.detach with detach.clone (#140264 ) Fixes #64532 As state in issue, replace `clone.detach` by `detach.clone` Pull Request resolved: https://github.com/pytorch/pytorch/pull/140264 Approved by: https://github.com/soulitzer	2024-11-13 07:01:02 +00:00
pralay	f06ee3e546	[pt2] Add meta for _add_relu (#140009 ) aten._add_relu doesn't have meta function registered, so in dynamic shape case it is throwing an error in dynamo logs: Error: `V1107 11:25:32.344000 140481543555072 torch/_dynamo/symbolic_convert.py:534] [0/1] [__graph_breaks] NotImplementedError: aten::_add_relu.Tensor: attempted to run this operator with Meta tensors, but there was no fake impl or Meta kernel registered. You may have run into this message while using an operator with PT2 compilation APIs (torch.compile/torch.export); in order to use this operator with those APIs you'll need to add a fake impl.` Pull Request resolved: https://github.com/pytorch/pytorch/pull/140009 Approved by: https://github.com/ezyang	2024-11-13 06:30:58 +00:00
Yuanhao Ji	8a80cee2f3	[Dynamo] Replace `torch._dynamo.optimize()` with `torch.compile()` [3/N] (#140247 ) related commits: - #139706 - #140238 - #140247 - #140253 Pull Request resolved: https://github.com/pytorch/pytorch/pull/140247 Approved by: https://github.com/soulitzer	2024-11-13 05:51:42 +00:00
Cheng, Penghui	5b1c67cc60	[Intel GPU] Avoid atomic add for XPU device in satter_add by deterministic mode (#137966 ) The "scatter_add" op with the deterministic mode in XPU device is not implemented, it will report that "scatter_add_kernel" does not have a deterministic implementation in UT. Just like the implementation of CUDA, we need to check _deterministic_algorithms in scatter_add op for the XPU device. The UT is in: https://github.com/intel/torch-xpu-ops/blob/main/test/xpu/test_scatter_gather_ops_xpu.py. We reused [PyTorch UT code]( `96b30dcb25/test/test_scatter_gather_ops.py (L233)`). Now the UT case is [skipped in torch-xpu-ops test](`4fa7921f1e/test/xpu/skip_list_common.py (L731)`). Will open it when this PR is merged. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137966 Approved by: https://github.com/EikanWang, https://github.com/guangyey, https://github.com/ezyang	2024-11-13 05:46:54 +00:00
Yutao Xu	79fb7416e7	[Intel GPU] Add device guard for XPU structured operator in torchgen (#138802 ) This PR is a supplement to https://github.com/pytorch/pytorch/pull/133980. The previous PR fulfill the basic functionality of XPU device guard, while we found it fails to address structured operators. With current PR, the code snippet in RegisterXPU.cpp is as follows, where we can see the device guard is successfully generated. ```c++ struct structured_exp_out_functional final : public at::native::structured_exp_out { void set_output_strided( int64_t output_idx, IntArrayRef sizes, IntArrayRef strides, TensorOptions options, DimnameList names ) override { auto current_device = guard_.current_device(); if (C10_UNLIKELY(current_device.has_value())) { TORCH_INTERNAL_ASSERT(current_device == options.device(), "structured kernels don't support multi-device outputs"); } else { guard_.reset_device(options.device()); } outputs_[output_idx] = create_out(sizes, strides, options); if (!names.empty()) { namedinference::propagate_names(outputs_[output_idx], names); } // super must happen after, so that downstream can use maybe_get_output // to retrieve the output at::native::structured_exp_out::set_output_raw_strided(output_idx, sizes, strides, options, names); } void set_output_raw_strided( int64_t output_idx, IntArrayRef sizes, IntArrayRef strides, TensorOptions options, DimnameList names ) override { auto current_device = guard_.current_device(); if (C10_UNLIKELY(current_device.has_value())) { TORCH_INTERNAL_ASSERT(current_device == options.device(), "structured kernels don't support multi-device outputs"); } else { guard_.reset_device(options.device()); } outputs_[output_idx] = create_out(sizes, strides, options); if (!names.empty()) { namedinference::propagate_names(outputs_[output_idx], names); } // super must happen after, so that downstream can use maybe_get_output // to retrieve the output at::native::structured_exp_out::set_output_raw_strided(output_idx, sizes, strides, options, names); } const Tensor& maybe_get_output(int64_t output_idx) override { return outputs_[output_idx]; } std::array<Tensor, 1> outputs_; c10::OptionalDeviceGuard guard_; }; ``` However, without current change, the generated code is ```c++ struct structured_exp_out_functional final : public at::native::structured_exp_out { void set_output_strided( int64_t output_idx, IntArrayRef sizes, IntArrayRef strides, TensorOptions options, DimnameList names ) override { outputs_[output_idx] = create_out(sizes, strides, options); if (!names.empty()) { namedinference::propagate_names(outputs_[output_idx], names); } // super must happen after, so that downstream can use maybe_get_output // to retrieve the output at::native::structured_exp_out::set_output_raw_strided(output_idx, sizes, strides, options, names); } void set_output_raw_strided( int64_t output_idx, IntArrayRef sizes, IntArrayRef strides, TensorOptions options, DimnameList names ) override { outputs_[output_idx] = create_out(sizes, strides, options); if (!names.empty()) { namedinference::propagate_names(outputs_[output_idx], names); } // super must happen after, so that downstream can use maybe_get_output // to retrieve the output at::native::structured_exp_out::set_output_raw_strided(output_idx, sizes, strides, options, names); } const Tensor& maybe_get_output(int64_t output_idx) override { return outputs_[output_idx]; } std::array<Tensor, 1> outputs_; }; ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/138802 Approved by: https://github.com/EikanWang, https://github.com/guangyey, https://github.com/ezyang	2024-11-13 05:40:38 +00:00
Tongzhou Wang	7b0d199471	[doc] fix grammar in "Extending Torch" (#140209 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/140209 Approved by: https://github.com/soulitzer	2024-11-13 05:34:43 +00:00
lzhang2	1886e33f60	Use device-agnostic runtime API in distributed DDP/FSDP instead of `cuda` device specific. (#137678 ) # Motivation This PR targets to use device-agnostic runtime API in distributed DDP/FSDP instead of `cuda` device specific. cc cc [@jgong5](https://github.com/jgong5) [@gujinghui](https://github.com/gujinghui) [@EikanWang](https://github.com/EikanWang) [@fengyuan14](https://github.com/fengyuan14) [@guangyey](https://github.com/guangyey) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137678 Approved by: https://github.com/kwen2501, https://github.com/guangyey, https://github.com/jgong5	2024-11-13 05:32:19 +00:00
Tongzhou Wang	4c6eebf4e2	[doc] improve code in fake tensor doc (#140329 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/140329 Approved by: https://github.com/soulitzer	2024-11-13 05:14:56 +00:00
Yuanhao Ji	d6b3ad4de2	[Dynamo] Replace `torch._dynamo.optimize()` with `torch.compile()` [2/N] (#140238 ) related commits: - #139706 - #140238 - #140247 - #140253 Pull Request resolved: https://github.com/pytorch/pytorch/pull/140238 Approved by: https://github.com/soulitzer	2024-11-13 05:13:39 +00:00
yucai-intel	42ad54c71b	[Intel GPU] Allow XPU device in LSTMCell operators (#140246 ) Refine device check logic for LSTMCell. Pull Request resolved: https://github.com/pytorch/pytorch/pull/140246 Approved by: https://github.com/soulitzer	2024-11-13 05:13:07 +00:00
Darshan Sanghani	3e277eb9fe	[pytorch/profiler] Profiler NCCL metadata can now contain collective Input and Ouput Tensor addrs (#139837 ) Studying memory access patterns is the primary use cases. Internal: The data may be used to find the % of operators that may cause alignment related overhead. Differential Revision: [D64413699](https://our.internmc.facebook.com/intern/diff/D64413699/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139837 Approved by: https://github.com/sraikund16	2024-11-13 04:57:16 +00:00
Yu, Guangye	4bbd6da331	Enable XPUEvent elapsed_time function (#134666 ) # Motivation This PR aims to enable `elapsed_time` function for `XPUEvent`. # Additional Context This PR depends on toolchain oneAPI 2025.0. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134666 Approved by: https://github.com/EikanWang, https://github.com/ezyang	2024-11-13 04:32:50 +00:00
chilli	e9fb2c6abe	Add some error messages for flexattention (#138891 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138891 Approved by: https://github.com/Skylion007, https://github.com/drisspg	2024-11-13 04:05:29 +00:00
Yu, Guangye	659d2132be	Add architecture to XPU device property (#138186 ) # Motivation Add `architecture` to XPU device property. In some cases, low-level application code can use special features or do specific optimizations depending on the device architecture, and this PR enables such applications. Modified from https://github.com/pytorch/pytorch/pull/129675/files Pull Request resolved: https://github.com/pytorch/pytorch/pull/138186 Approved by: https://github.com/ezyang	2024-11-13 03:35:13 +00:00
Ryan Guo	39d1c91c33	[dynamo] Restrict support for `out=` variants of torch operators (#140202 ) There has been a series of attempts to provide support for resizing in torch operators like `torch.sigmoid(x, out=y)`, i.e., `y` would have a different shape before and after this expression. Prior to this patch, we have some checks to graph break if the shape changed. This patch extends 1. extends the existing check and graph break for any shape change, not just for `TensorVariable` with source field. 2. removes an old code path which was introduced to address the shape change, but became obselete in that regard because we added extra checks to graph break upon shape change. Moreover, this old code path is unsound, it tries to replace references to the old `TensorVariable` the new one returned by `wrap_fx_proxy`, but it only does the replacement in `symbolic_locals`, which breaks when cells are involved. In general the old `TensorVariable` could be _anywhere_, think the `replace_all` we had for immutable VTs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/140202 Approved by: https://github.com/jansel ghstack dependencies: #140035, #140036, #140149, #140150, #140151, #140201	2024-11-13 03:14:23 +00:00
Ryan Guo	65615915ed	[dynamo] Fix bugs in side-effect pruning and codegen (#140201 ) This patch fixes 2 things which are exposed if we have `NewCellVariable` rather than `ClosureVariable` to model python cells: 1. `codegen_save_tempvars` must run first, to establish `source` for objects, otherwise they can't reconstruct. 2. `prune_dead_object_new` must account for `OutputGraph.backward_state` as well, since it also contains variables that must live. Pull Request resolved: https://github.com/pytorch/pytorch/pull/140201 Approved by: https://github.com/jansel ghstack dependencies: #140035, #140036, #140149, #140150, #140151	2024-11-13 03:14:23 +00:00
Ryan Guo	3a622c5685	[dynamo] Refine `LocalSource.cell_or_freevar` to `LocalSource.is_input` (#140151 ) The `cell_or_freevar` was added in #106403 to help us ensure Dynamo-export only allows graph input that depends on the frame input (rather than a captured cell, for instance). However, when taken literally, the `cell_or_freevar` condition is actually not accurate, because for frame inputs that are also cells (i.e., captured by some inner function), we actually set the `cell_or_freevar` flag to false. This makes sense, because otherwise the existing implementation would prevent Dynamo-export to add any of these inputs to the graph. To help with reasoning, this patch refines the `cell_or_freevar` flag to what we really want to check -- `is_input`, and updates the relevant use sites. Pull Request resolved: https://github.com/pytorch/pytorch/pull/140151 Approved by: https://github.com/jansel ghstack dependencies: #140035, #140036, #140149, #140150	2024-11-13 03:14:23 +00:00
Ryan Guo	d34d5ccec5	[dynamo] Fix some corner cases for modeling pre-existing cells (#140150 ) In `UserFunctionVariable.bind_args`, there's a rare case when the underlying function satisfies all conditions below 1. The function captures a pre-existing cell 2. The cell isn't captured by root frame 3. `UserFunctionVariable.source` is `None` In such cases, Dynamo would model the cell as its content (just like what we do for cells in the root frame). However, this could break in two cases: - We could have multiple instances of `UserFunctionVariable`, where some have source and others don't. This means sometimes we'll model the cell as a `NewCellVariable`, and sometimes as its content. This causes issues because writes to the `NewCellVariable` would be buffered in `SideEffects` and never get picked up by the other modeling. - Only when `UserFunctionVariable` has a source, do we check whether we already had a `NewCellVariable` for the captured cell. This again causes Dynamo to potentially have multiple representations for the same cell object, resulting in a similar "buffered writes not reflected" issue as above. This patch fixes the above 2 issues by 1. modeling captured cells of sourceless `UserFunctionVariable` as immutable `NewCellVariable`, and adds a few lines in `SideEffects` to account for its immutability. 2. always checking whether we already had a `NewCellVariable` for the captured cell, before constructing a new one. Tests are added for each aforementioned case. I also left a TODO to investigate why exactly we would lose source information for `UserFunctionVariable`. Some cases are easily fixable, but others not so much. Pull Request resolved: https://github.com/pytorch/pytorch/pull/140150 Approved by: https://github.com/jansel ghstack dependencies: #140035, #140036, #140149	2024-11-13 03:14:23 +00:00
Ryan Guo	6a821c9e6a	[dynamo] Remove cell unboxing/restart optimization (#140149 ) We added an unboxing optimization to avoid writes to cells that existed before Dynamo tracing (such writes interfere with HOPs). However, the avoided write shouldn't be there in the first place, since we were basically creating an empty `NewCellVariable`, and then write the pre-existing content into the variable. This patch 1. adds logic to bypass the initial write for pre-existing cells without undermining correctness. 2. removes the unboxing optimization and the restart code path. Fixes #137456, #138491; also see those issues for more historical context. Pull Request resolved: https://github.com/pytorch/pytorch/pull/140149 Approved by: https://github.com/ezyang, https://github.com/jansel ghstack dependencies: #140035, #140036	2024-11-13 03:14:23 +00:00
Ryan Guo	698ff07323	[dynamo] Fix name collision bug for captured cells and locals (#140036 ) The `export_freevars` method was introduced very early on, for propagating writes to unboxed cells from child to parent frame, see https://github.com/pytorch/torchdynamo/commit/d0c10341. However, it's no longer needed after we started to modify root tracer's `symbolic_locals` directly for the unboxed cells, see https://github.com/pytorch/torchdynamo/commit/663e4d92. As a result, we no longer need `export_freevars`. In fact, it can cause a very subtle bug when name collision happens across the parent and child frames during inlining, because the parent frame isn't necessarily the frame that defined the cell captured by child frame. In summary, this patch removes the `export_freevars` bits, and adds a regression test. Pull Request resolved: https://github.com/pytorch/pytorch/pull/140036 Approved by: https://github.com/williamwen42, https://github.com/jansel ghstack dependencies: #140035	2024-11-13 03:14:23 +00:00
Ryan Guo	8dc3cb043c	[dynamo] Put cells into `closure_cells` and document relevant parts (#140035 ) This patch establishes the invariant that `ClosureVariable` and `NewCellVariable` are always in `closure_cells`, never in `symbolic_locals`, and therefore removes some duplicated code paths. Pull Request resolved: https://github.com/pytorch/pytorch/pull/140035 Approved by: https://github.com/jansel	2024-11-13 03:14:23 +00:00
Nikita Shulga	d3da6d49df	Add `cmake` to requirements.txt (#140491 ) As one can not build PyTorch in clean venv if cmake is not installed Pull Request resolved: https://github.com/pytorch/pytorch/pull/140491 Approved by: https://github.com/yangw-dev, https://github.com/huydhn	2024-11-13 02:53:25 +00:00
Tsung-Hsien Lee	953286b850	[DTensorTestbase] Fix `@with_comms` inactive problem (#139637 ) Summary: `with_comms()` is mostly used as a decorator with an optional input argument `eager_init`. The problem of a decorator with input argument is that it has to be used with invocation always, i.e., you have to use as `with_comms()` rather than `with_comms` which majority of the existing usages. This diff tries to provide a solution such that we could use `with_comms`, `with_comms()`, `with_comms(eager_init=False)`, and `with_comms(eager_init=True)`. Test Plan: Contbuild & OSS CI Differential Revision: D65385700 Pull Request resolved: https://github.com/pytorch/pytorch/pull/139637 Approved by: https://github.com/wz337	2024-11-13 02:45:02 +00:00
cyy	40fb738197	Use Wextra-semi (#140236 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/140236 Approved by: https://github.com/ezyang	2024-11-13 02:15:16 +00:00
eellison	fb7148d05d	Fix split decomp returning self (#140065 ) Previously the split decomp would return the input when there were no splits. this errors in torch.compile (or FakeTensorMode) with : > RuntimeError: View operation returned a tensor that is the same as the input base tensor. This is no longer allowed; you must explicitly create a new tensor (e.g., using .detach()). As a user, you could have made a mistake implementing __torch_dispatch__ or a Python operator decomposition or meta registration; if that's not the case, please report a bug to PyTorch or the backend you are using. Fix for https://github.com/pytorch/pytorch/issues/133394 Differential Revision: [D65635070](https://our.internmc.facebook.com/intern/diff/D65635070) Pull Request resolved: https://github.com/pytorch/pytorch/pull/140065 Approved by: https://github.com/bdhirsh	2024-11-13 01:58:02 +00:00
xiaowangintel	4906413b70	[Intel GPU] Support RegisterSparseXPU.cpp codegen. (#139267 ) This PR is to support code generation for sparse operations on Intel GPUs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139267 Approved by: https://github.com/EikanWang, https://github.com/atalman	2024-11-13 01:41:43 +00:00
Yu, Guangye	891ba2ec8a	Fix xpu cmake typo (#140374 ) # Motivation This PR aims to fix a typo in the CMake build. The typo impacts the XPU Windows build and results in PyTorch being built without XPU, which is unexpected. Pull Request resolved: https://github.com/pytorch/pytorch/pull/140374 Approved by: https://github.com/EikanWang, https://github.com/ezyang, https://github.com/atalman	2024-11-13 00:26:35 +00:00
Aaron Gokaslan	3d2dd14217	[BE][Bugfix]: Add rad2deg to pointwise ops (#140290 ) Adds missing pontwise tags. Apparently this allows NestedTensor to properly generate a function for opinfo Pull Request resolved: https://github.com/pytorch/pytorch/pull/140290 Approved by: https://github.com/jbschlosser	2024-11-13 00:02:00 +00:00
Andrea Frittoli	3e82b1f6c0	Build magma tarball for cuda 126 (#140143 ) Now that manylinux 2.28 is available with cuda 1.26 https://github.com/pytorch/pytorch/pull/139909 we can build the magma tarball for cuda 1.26. Fixes #139397 Pull Request resolved: https://github.com/pytorch/pytorch/pull/140143 Approved by: https://github.com/atalman	2024-11-12 23:42:26 +00:00
PyTorch MergeBot	d48ea29b9a	Revert "[aoti] Remove dir after packaging (#140022 )" This reverts commit 8c6abe5a8c42be3909496d2cd3d1f194a8493460. Reverted https://github.com/pytorch/pytorch/pull/140022 on behalf of https://github.com/huydhn due to Sorry for reverting your change but the lint failure is legit ([comment](https://github.com/pytorch/pytorch/pull/140022#issuecomment-2471847439))	2024-11-12 23:35:27 +00:00
Mikayla Gawarecki	1f28235ee2	Allow NJT by default for weights_only torch.load (#140304 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/140304 Approved by: https://github.com/jbschlosser	2024-11-12 23:34:27 +00:00
atalman	096929c1e8	Add safe.directory to Almalinux docker image (#140454 ) Something that was accidentally dropped by: https://github.com/pytorch/pytorch/pull/140157 Needs to be re-added. I believe its part of our Docker images. Please see: https://github.com/pytorch/pytorch/blob/main/.ci/docker/manywheel/Dockerfile#L21 Pull Request resolved: https://github.com/pytorch/pytorch/pull/140454 Approved by: https://github.com/seemethere, https://github.com/malfet	2024-11-12 23:28:12 +00:00
Wei Wei	70a223cce6	[aotinductor] fix a few issues in bandwidth profiler (#139607 ) Summary: The recent tries on bandwidth profiler is not as expected. I have observed a few issues and tried to fix them in this diff: 1. The return of the DebugAutotuner class 2. Profiling results shows really large overhead. DebugAutotuner.run() returns the benchmark time around 45ms while CachingAutotuner.run() returns the benchmark time around 0.45ms. The `_find_names` and `re.match` takes 45ms: P1669186358 After we commenting out the above _find_names and re.match, the benchmark time become consistent with non-profiling mode: P1669185589 3. introduce a variable `bandwidth_info` to control the path in DebugAutotuner.run(). During benchmarking of configuration selection, we should turn off the `bandwidth_info` After applying this diff, the profiling issues mentioned above are fixed: P1669273172 Test Plan: ``` TORCHINDUCTOR_FORCE_DISABLE_CACHES=1 TORCHINDUCTOR_PROFILE=1 TORCHINDUCTOR_PROFILE_OUTPUT=~/tmp/profile.txt TORCH_LOGS='+inductor,+schedule,output_code' TORCHINDUCTOR_UNIQUE_KERNEL_NAMES=1 TORCHINDUCTOR_BENCHMARK_KERNEL=1 TORCHINDUCTOR_MAX_AUTOTUNE=1 CUDA_VISIBLE_DEVICES=5 buck run mode/{opt,inplace} scripts/wwei6/triton_examples:test_mat 2>&1 \| tee profiling-5.log ``` If we want to disable the Aten backend, just add TORCHINDUCTOR_MAX_AUTOTUNE_GEMM_BACKENDS="TRITON" Differential Revision: D64883079 Pull Request resolved: https://github.com/pytorch/pytorch/pull/139607 Approved by: https://github.com/chenyang78	2024-11-12 23:26:47 +00:00
Shivam Raikundalia	267641f6f1	[Profiler] Add More Logging for Dynamic Collection API (#140285 ) Summary: Add a log warning users about how disabling only CUDA events can cause incorrect correlation IDs Test Plan: Log was printed in the correct scenario Differential Revision: D65762576 Pull Request resolved: https://github.com/pytorch/pytorch/pull/140285 Approved by: https://github.com/sanrise	2024-11-12 22:59:04 +00:00
Howard Huang	7578a0b268	[pipelining] clean up stage functions (#140418 ) Clean up methods related to stage input/output shape verification which are no longer needed Pull Request resolved: https://github.com/pytorch/pytorch/pull/140418 Approved by: https://github.com/wconstab ghstack dependencies: #140019	2024-11-12 21:42:08 +00:00
Howard Huang	2ac71a5771	[pipelining] add type checking to _backward functions (#140019 ) fix https://github.com/pytorch/pytorch/issues/139405 Pull Request resolved: https://github.com/pytorch/pytorch/pull/140019 Approved by: https://github.com/wconstab	2024-11-12 21:42:08 +00:00
Bin Bao	1f590feaf7	[AOTI][refactor] Update codegen_int_array_var API (#140299 ) Summary: codegen_int_array_var and codegen_reinterpret_view need to call different writeline functions depending on which part of code it's writing. Previously their APIs take a writer and implicitly assign a default writer if needed, which is not intuitive. Update their APIs to explicitly take a writeline function. Differential Revision: [D65774584](https://our.internmc.facebook.com/intern/diff/D65774584) Pull Request resolved: https://github.com/pytorch/pytorch/pull/140299 Approved by: https://github.com/frank-wei, https://github.com/chenyang78	2024-11-12 21:39:41 +00:00
angelayi	8c6abe5a8c	[aoti] Remove dir after packaging (#140022 ) Update AOTI to return a list of files that it generates when `aot_inductor.package=True`. Then we will only package the files that are in that list. This should fix the [caching issue](https://fb.workplace.com/groups/1028545332188949/permalink/1081702043539944/) and hopefully https://github.com/pytorch/pytorch/issues/140053. Pull Request resolved: https://github.com/pytorch/pytorch/pull/140022 Approved by: https://github.com/larryliu0820, https://github.com/desertfire, https://github.com/malfet	2024-11-12 21:36:24 +00:00
Catherine Lee	0db21a6b23	Remove most rockset references (#139922 ) Remove most references to rockset: * replace comments and docs with a generic "backend database" * Delete `upload_to_rockset`, so we no longer need to install the package. * Do not upload perf stats to rockset as well (we should be completely on DynamoDB now right @huydhn?) According to VSCode, it went from 41 -> 7 instances of "rockset" in the repo Pull Request resolved: https://github.com/pytorch/pytorch/pull/139922 Approved by: https://github.com/huydhn, https://github.com/ZainRizvi	2024-11-12 21:17:43 +00:00
atalman	4675875d16	Fix lint after #138899 (#140446 ) Fixes Lint after: https://github.com/pytorch/pytorch/pull/138899 Due to landrace. Run ``./regenerate.sh`` Pull Request resolved: https://github.com/pytorch/pytorch/pull/140446 Approved by: https://github.com/wdvr, https://github.com/huydhn, https://github.com/seemethere, https://github.com/malfet	2024-11-12 20:53:58 +00:00
Nikita Shulga	1172a10574	[Build] Do not regenerate code endlessly without XPU (#140438 ) Before this change, if one builds PyTorch without XPU build process will be perpetually regenerating code because of the reference to non-existing file, that will make autograd codegened files always out of date, see part of the `ninja -d explain torch_cpu` output: ``` ninja explain: output ../torch/csrc/inductor/aoti_torch/generated/c_shim_xpu.cpp doesn't exist ninja explain: output third_party/kineto/libkineto/CMakeFiles/libkineto_defs.bzl of phony edge with no inputs doesn't exist ninja explain: third_party/kineto/libkineto/CMakeFiles/libkineto_defs.bzl is dirty ninja explain: /Users/malfet/git/pytorch/pytorch/torch/csrc/autograd/generated/Functions.cpp is dirty ``` This is a regression introduced by https://github.com/pytorch/pytorch/pull/139025. After this change, incremental rebuilds with no changes cause no build actions: ``` % ninja -j1 -v -d explain -n torch_cpu ninja explain: output third_party/kineto/libkineto/CMakeFiles/libkineto_defs.bzl of phony edge with no inputs doesn't exist ninja explain: third_party/kineto/libkineto/CMakeFiles/libkineto_defs.bzl is dirty ninja: no work to do. ``` Test plan: Wait for at least on XPU build to finish... Fixes https://github.com/pytorch/pytorch/issues/140432 Pull Request resolved: https://github.com/pytorch/pytorch/pull/140438 Approved by: https://github.com/kit1980, https://github.com/huydhn	2024-11-12 20:19:28 +00:00
Ting Lu	14bb49fe98	Add CUDA 12.6 Linux Builds to Binaries Matrix (#138899 ) Related to #138440 Issue tracker: https://github.com/pytorch/pytorch/issues/138609 Version based on https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html Pull Request resolved: https://github.com/pytorch/pytorch/pull/138899 Approved by: https://github.com/atalman Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2024-11-12 19:52:31 +00:00
Aaron Gokaslan	034b105d53	[BE][Ez]: Add NT unary op macro (#140213 ) * Adds a macro to simplify adding more unary ops to NT. * Adds sqrt support to NT Pull Request resolved: https://github.com/pytorch/pytorch/pull/140213 Approved by: https://github.com/jbschlosser	2024-11-12 19:50:06 +00:00
PyTorch MergeBot	069a71023b	Revert "[inductor] Refactor reduction type choices into V.choices (#139585 )" This reverts commit 6438c8637a7e28b676a1ccfe942dc37375d0cb14. Reverted https://github.com/pytorch/pytorch/pull/139585 on behalf of https://github.com/kit1980 due to breaking internal builds, see D65800124 ([comment](https://github.com/pytorch/pytorch/pull/139585#issuecomment-2471392822))	2024-11-12 19:32:14 +00:00
PyTorch MergeBot	c0ddd10f6d	Revert "[inductor] Support fixed triton configs defined at compile time (#140217 )" This reverts commit 29114e44fa7a17a3a2112d76937ae3b4cf9d33ce. Reverted https://github.com/pytorch/pytorch/pull/140217 on behalf of https://github.com/kit1980 due to breaking internal builds, see D65800124 ([comment](https://github.com/pytorch/pytorch/pull/139585#issuecomment-2471392822))	2024-11-12 19:32:14 +00:00
Zhenbin Lin	8304a1faad	OpenReg: Fix issue when casting tensor on the executor size (#140255 ) Previously we assumed that the number of tensor elements multiplied by the type size is not greater than the allocated memory size. However in some scenarios such as `tensor.expand`, the stride can be zero, which makes the assumption not true. Pull Request resolved: https://github.com/pytorch/pytorch/pull/140255 Approved by: https://github.com/ezyang	2024-11-12 19:29:21 +00:00
Haoran Zhang	cc8e832066	[AMD] use DC method for linalg.eigh (#140327 ) Summary: Jacobi method has larger numerical errors, see D64997718, use divide-and-conquer method instead. Test Plan: CI Differential Revision: D65786796 Pull Request resolved: https://github.com/pytorch/pytorch/pull/140327 Approved by: https://github.com/jianyuh	2024-11-12 19:17:25 +00:00
fulvius31	726424f4de	Use base32 triton cache function if base64 is not found (#140297 ) In #140190 the base64 function is imported from triton But, since triton-lang/triton#5088 , the base64 function was replaced to base32 Pull Request resolved: https://github.com/pytorch/pytorch/pull/140297 Approved by: https://github.com/davidberard98	2024-11-12 19:05:21 +00:00
Yukio Siraichi	c182c7ccfc	Fix `triangular_solve` meta function out parameter names. (#140186 ) This PR replaces the parameter names specified in the `triangular_solve_meta` function (specifically in its `@out_wrapper(...)` decorator) by those written in the _native_functions.yaml_ file. This name mismatch caused the operation to fail when using the meta device (see error below): ```python Traceback (most recent call last): File "examples/test.py", line 23, in <module> torch.triangular_solve(b.to("meta"), A.to("meta"), out=meta_out) File "torch/_decomp/__init__.py", line 100, in _fn return f(args, kwargs, out=None if is_none else out_kwargs) File "torch/_prims_common/wrappers.py", line 289, in _fn result = fn(args, **kwargs) TypeError: triangular_solve_meta() got an unexpected keyword argument 'X' ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/140186 Approved by: https://github.com/ezyang	2024-11-12 19:04:34 +00:00
Masaki Kozuki	6a368b3fc5	Add ScalarList overload to `_foreach_lerp` (#134482 ) Related: - https://github.com/pytorch/pytorch/issues/133367 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134482 Approved by: https://github.com/janeyx99	2024-11-12 19:03:41 +00:00
cyy	7624d625c0	[Reland][7/N] Fix Wextra-semi warning (#140342 ) Reland of #140225 to fix a change in FBCODE_CAFFE2 Pull Request resolved: https://github.com/pytorch/pytorch/pull/140342 Approved by: https://github.com/kit1980	2024-11-12 18:55:31 +00:00
PyTorch MergeBot	e4195f8060	Revert "[logging][ez] Add timer logging for pickling and unpickle for object based collective (#139757 )" This reverts commit 41e4d88584c4ed0708cd1d93c71cd4ee2e1bbbb5. Reverted https://github.com/pytorch/pytorch/pull/139757 on behalf of https://github.com/izaitsevfb due to reverted internally, see D65682470 ([comment](https://github.com/pytorch/pytorch/pull/139757#issuecomment-2471316405))	2024-11-12 18:53:37 +00:00
cyy	a3cff4bbd4	[Environment Variable][7/N] Use thread-safe getenv functions (#140211 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/140211 Approved by: https://github.com/ezyang, https://github.com/eqy	2024-11-12 18:49:51 +00:00
Aaron Gokaslan	928b8ec633	[BE]: Add pointwise tag to isfinite (#140291 ) Adds pointwise tag to isfinite Pull Request resolved: https://github.com/pytorch/pytorch/pull/140291 Approved by: https://github.com/jbschlosser	2024-11-12 18:02:07 +00:00
Yuanhao Ji	5aadaaf2b5	[Dynamo] Allow `filter()` to handle infinite iterator (#138305 ) Fixes #137380 ```python import torch def filt(x): return x < 10 @torch.compile(backend="eager", fullgraph=True) def f(x): x = x + 1 return zip(range(3), filter(filt, itertools.count())) print(list(f(torch.ones(3)))) # [(0, 0), (1, 1), (2, 2)] @torch.compile(backend="eager") def g(x): x = x + 1 return filter(filt, [1, 2, 3]) res = g(torch.ones(3)) assert isinstance(res, filter) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/138305 Approved by: https://github.com/williamwen42	2024-11-12 17:32:56 +00:00
Nikita Shulga	7a02457053	[BE] Fix error message in torch._scaled_mm (#140343 ) Followup after https://github.com/pytorch/pytorch/pull/140307 that fixes error message for mat1, but not for mat2 Pull Request resolved: https://github.com/pytorch/pytorch/pull/140343 Approved by: https://github.com/kit1980	2024-11-12 17:13:41 +00:00
Richard Zou	60db702a42	Noop m.set_python_module on C10_MOBILE builds (#140273 ) Summary: This was causing issues. Since Python isn't available on C10_MOBILE anyways, it's OK to noop the call to m.set_python_module. We no-op it by just never calling registerPythonModule. This is a fix only for C10_MOBILE, there's likely a corresponding issue for regular PyTorch that we need to work through (https://github.com/pytorch/pytorch/issues/140272) Test Plan: - tests Differential Revision: D65758016 Pull Request resolved: https://github.com/pytorch/pytorch/pull/140273 Approved by: https://github.com/ezyang, https://github.com/malfet	2024-11-12 16:35:01 +00:00
zengxian	d723abf686	[CI]Move CPU inductor test runners and cases to save cost (#136313 ) For CPU, only SPR has native support for AMP BF16. Ref: pytorch/pytorch#138476 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136313 Approved by: https://github.com/jgong5, https://github.com/zxiiro, https://github.com/chuanqi129, https://github.com/desertfire	2024-11-12 16:15:20 +00:00
Guilherme Leobas	faef1510f8	Add batch rule for `native_dropout_backward` (#140140 ) Fixes: #122432 Pull Request resolved: https://github.com/pytorch/pytorch/pull/140140 Approved by: https://github.com/zou3519	2024-11-12 16:14:49 +00:00
Jane Xu	213b8ef163	[BE] add empty tensor testing for _foreach_addcmul/div (#140276 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/140276 Approved by: https://github.com/jbschlosser ghstack dependencies: #140191	2024-11-12 15:35:06 +00:00
Jane Xu	92fb1f79b8	[BE] Test interspersed empty tensors for _foreach_norm test parity (#140191 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/140191 Approved by: https://github.com/jbschlosser	2024-11-12 15:35:06 +00:00
Masaki Kozuki	71d8bb7ede	implement `torch._foreach_rsqrt` (#134574 ) Related: - #133367 c Pull Request resolved: https://github.com/pytorch/pytorch/pull/134574 Approved by: https://github.com/eqy, https://github.com/janeyx99	2024-11-12 15:34:35 +00:00
Benjamin Glass	8cb0b932a1	Fix broken AOTInductor node and kernel counts (#139435 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139435 Approved by: https://github.com/desertfire ghstack dependencies: #139411, #139412	2024-11-12 15:22:46 +00:00
Benjamin Glass	fef16fe254	Enable all fixed cpp_wrapper tests (#139412 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139412 Approved by: https://github.com/desertfire ghstack dependencies: #139411	2024-11-12 15:22:46 +00:00
Benjamin Glass	761b42bc08	cpp_wrapper_cpu: Ensure reinterpret_view results in RAIIAtenTensorHandle (#139411 ) Fixes segfaults caused by views being implicitly converted to AtenTensorHandle, then being destroyed before use. Closes #135559. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139411 Approved by: https://github.com/desertfire	2024-11-12 15:22:38 +00:00
Aleksei Nikiforov	057f0dca78	Don't use sudo to checkout sources (#140263 ) Move this part out of https://github.com/pytorch/pytorch/pull/125401 and try using it for all architectures. Pull Request resolved: https://github.com/pytorch/pytorch/pull/140263 Approved by: https://github.com/zxiiro, https://github.com/huydhn	2024-11-12 14:29:17 +00:00
Andrew Gu	78a8f7f5c3	[FSDP2] Fix CUDA sync for bf16 HSDP AR, fp32 params (#140044 ) Differential Revision: [D65621037](https://our.internmc.facebook.com/intern/diff/D65621037) Pull Request resolved: https://github.com/pytorch/pytorch/pull/140044 Approved by: https://github.com/weifengpy	2024-11-12 13:31:40 +00:00
atalman	51e8a13d00	CD Enable Python 3.13 on windows (#138095 ) Adding CD windows. Part of: https://github.com/pytorch/pytorch/issues/130249 Builder PR landed with smoke test: https://github.com/pytorch/builder/pull/2035 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138095 Approved by: https://github.com/Skylion007, https://github.com/malfet	2024-11-12 12:28:10 +00:00
Yu, Guangye	ff91fcc991	Refactor device index bound check for xpu code (#120768 ) # Movitation refer to [Increased compile time max GPUs to 512. Switched to int16_t DeviceIndex.](https://github.com/pytorch/pytorch/pull/119639), we use `c10::Device::MAX_NUM_DEVICES` to make sure the number of XPU devices is valid in PyTorch. # Solution Use `TORCH_CHECK` to check if the number of XPU devices exceeds `c10::Device::MAX_NUM_DEVICES` when enum XPU devices. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120768 Approved by: https://github.com/jgong5, https://github.com/albanD, https://github.com/tringwald	2024-11-12 12:09:11 +00:00
Jiang, Yanbing	f77eb07662	Split int4wo weight packing (#139611 ) Fixes https://github.com/pytorch/ao/issues/1117. This PR is to seperate int4wo weight packing between CPU and other devices, to help implement `INT4CPULayout` in torchao based on https://github.com/pytorch/ao/issues/1117#issuecomment-2451252756. Now, for CPU, the input `weight` of `_convert_weight_to_int4pack_for_cpu` is [n, k] int32, output is [n, k / 2] uint8. The input packed weight of `_weight_int4pack_mm_for_cpu` is [n, k / 2] uint8. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139611 Approved by: https://github.com/jerryzh168	2024-11-12 10:12:50 +00:00
Avik Chaudhuri	7691064768	dispatcher module for multiple graphs (#139439 ) Differential Revision: [D65307961](https://our.internmc.facebook.com/intern/diff/D65307961/) This PR introduces the concept of a "dispatcher" module `n` that carries multiple interpreter modules `n`, `n@1`, `n@2`, etc., each corresponding to a particular call of `n` and thus might carry a different specialized graph. We only do this when we're preserving module call signatures for `n`. The carried modules have the same number and order of calls to `n` appearing in the original module / exported program. In the unflattened module, all those calls go to the "dispatcher" module which internally tracks how many calls have been made so far and invokes the corresponding interpreter module. We reset this tracking after a successful or unsuccessful run of the unflattened module. Overall this makes swapping easier when module call signatures are preserved. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139439 Approved by: https://github.com/tugsbayasgalan ghstack dependencies: #139438	2024-11-12 09:53:40 +00:00
Avik Chaudhuri	9a5175e836	fix shared submodule module call signature (#139438 ) Differential Revision: [D65308061](https://our.internmc.facebook.com/intern/diff/D65308061/) When a shared submodule is called multiple times with different aliases, e.g., `self.a` and `self.b` are both `C()` under the hood and we have calls to both `self.a(...)` and `self.b(...)`, we wrap `C()` to emit as many export tracepoints as there are aliases. This caused us to compute module call signatures that conflated information: we'd add inputs and outputs of one call to inputs and outputs of a different call. Overall preserving module call signatures in the presence of shared submodules was borked because of this bug. The fix is to pay attention to the nn module stack, which accurately tracks individual calls, thus allowing us to ignore some export tracepoints that get the module correct but not the alias through which the call was made. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139438 Approved by: https://github.com/zhxchen17	2024-11-12 09:53:40 +00:00
Ma Jian	a104b560d8	fix trace nn.parameters() (#138149 ) Fixes #137764 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138149 Approved by: https://github.com/anijain2305	2024-11-12 09:43:45 +00:00
Huamin Li	330c9577a3	[Inductor] make decompose_mm_pass support cpu case (#139696 ) Summary: Previously, decompose_mm_pass only works for gpu case. This diff make it support some cpu case as well for the performance optimization Differential Revision: D65226131 Pull Request resolved: https://github.com/pytorch/pytorch/pull/139696 Approved by: https://github.com/eellison	2024-11-12 06:22:23 +00:00
Ryan Guo	965555d1fd	[dynamo] Remove dead code path for capturing `__class__` in `UserFunctionVariable` (#140034 ) This was introduced in https://github.com/pytorch/torchdynamo/commit/d0c10341 as limited support for pre-existing cells, since we know `__class__` wouldn't be modified in most cases. It's no longer needed now that we have much more support for these cells. Example: ```python class Foo(): def __init__(self): super().__init__() print(Foo.__init__.__code__.co_freevars) # ('__class__',) print(Foo.__init__.__closure__) # (<cell at 0x1011fb310: type object at 0x10fe185b0>,) ``` This patch also exposed and fixes a bug in `NNModuleVariable.var_getattr`, where Dynamo wasn't propagating source correctly. Pull Request resolved: https://github.com/pytorch/pytorch/pull/140034 Approved by: https://github.com/williamwen42, https://github.com/anijain2305, https://github.com/jansel	2024-11-12 05:54:35 +00:00
PyTorch MergeBot	09bab7566a	Revert "Allow NJT by default for weights_only torch.load (#140304 )" This reverts commit 455dc4c14264a0cd7d70ba5328382a9fb7769094. Reverted https://github.com/pytorch/pytorch/pull/140304 on behalf of https://github.com/huydhn due to A bunch of failure shows up in trunk after this lands, so probably a landrace ([comment](https://github.com/pytorch/pytorch/pull/140304#issuecomment-2469602096))	2024-11-12 04:53:10 +00:00
Animesh Jain	469eae2ba2	[inductor][invoke_subgraph] Fix SDPA seed/offset issue (#140070 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/140070 Approved by: https://github.com/eellison	2024-11-12 04:40:03 +00:00
Junjie Wang (PyTorch)	23db92bad2	[FR] refactor build collective and return more info to db (#140082 ) (#140303 ) Summary: This change is trying to return the result of analysis with more details. Internally the contract is listed in https://docs.google.com/document/d/19ON5jKlYirT76D4Q-OoGMgD-U2L_sCDnUd_RE1gfiLE/edit?tab=t.0. For OSS, this change is BC to the current behavior. Also create a new state object which handle logging and convert to object to Collective and NCCLCall. Test Plan: CI and more thorough testing is on the way. Reviewed By: VieEeEw Differential Revision: D65612448 Pull Request resolved: https://github.com/pytorch/pytorch/pull/140303 Approved by: https://github.com/c-p-i-o	2024-11-12 03:43:02 +00:00
Mikayla Gawarecki	455dc4c142	Allow NJT by default for weights_only torch.load (#140304 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/140304 Approved by: https://github.com/jbschlosser	2024-11-12 02:04:18 +00:00
ZhiweiYan-96	19eff28ff3	[Intel GPU] Extract common utils for conv&qconv (#139580 ) # Motivation This PR is a precursor to #133080. The PR extracts common logics in convolution and quantized convolution into `Utils.cpp`. With such modification, these two operators could share codes like input format querying, op layout querying. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139580 Approved by: https://github.com/EikanWang, https://github.com/guangyey, https://github.com/malfet ghstack dependencies: #139721	2024-11-12 02:00:33 +00:00
ZhiweiYan-96	e21ee6327d	[Intel GPU] format XPU oneDNN integration codes (#139721 ) # Motivation This PR add XPU oneDNN integration codes into lintrunner config `.lintrunner.toml`, which would format cpp source and cpp headers codes at `aten/src/ATen/native/mkldnn/xpu/` and `aten/src/ATen/native/mkldnn/xpu/detail/` Pull Request resolved: https://github.com/pytorch/pytorch/pull/139721 Approved by: https://github.com/guangyey, https://github.com/cyyever, https://github.com/EikanWang, https://github.com/Skylion007, https://github.com/malfet	2024-11-12 01:52:06 +00:00
Edward Z. Yang	4e487eda7a	Add linters for C10_UNUSED and C10_NODISCARD (#140302 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/140302 Approved by: https://github.com/Skylion007	2024-11-12 01:50:11 +00:00
Valentine233	263a5bf95e	[cpu] Modify inductor opt flag --- ftree-loop-vectorize (#136827 ) Reopen https://github.com/pytorch/pytorch/pull/121782, as more optimizations have landed. Fixes https://github.com/pytorch/pytorch/issues/115261, https://github.com/pytorch/pytorch/issues/113017. For CPU inductor path, remove -ftree-loop-vectorize from optimization flags to fix functional issues. ### Validation on 3 benchmark suites #### FP32 ![image](https://github.com/user-attachments/assets/ec920928-fa36-467f-ba07-d2c05c51b92e) Outlier models (speedup<0.8, single socket): None. #### BF16 ![image](https://github.com/user-attachments/assets/4a301e5e-147d-4b74-beb1-40290969ed80) Outlier models (speedup<0.8, single socket multi threads): - functorch_dp_cifar10 0.58 - opacus_cifar10 0.57 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136827 Approved by: https://github.com/jansel, https://github.com/jgong5	2024-11-12 01:26:18 +00:00
Jason Ansel	29114e44fa	[inductor] Support fixed triton configs defined at compile time (#140217 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/140217 Approved by: https://github.com/shunting314 ghstack dependencies: #139585	2024-11-12 00:56:02 +00:00
Jason Ansel	6438c8637a	[inductor] Refactor reduction type choices into V.choices (#139585 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139585 Approved by: https://github.com/shunting314	2024-11-12 00:56:02 +00:00
Wouter Devriendt	e76f57d54e	add missing bracket in error message (#140307 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/140307 Approved by: https://github.com/kit1980	2024-11-12 00:45:14 +00:00
PyTorch MergeBot	dbb55b448b	Revert "[7/N] Fix Wextra-semi warning (#140225 )" This reverts commit ffb979032dc149b4c895526fe5b92d713ed7b1e1. Reverted https://github.com/pytorch/pytorch/pull/140225 on behalf of https://github.com/kit1980 due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/140225#issuecomment-2469312229))	2024-11-12 00:02:06 +00:00
Tugsbayasgalan Manlaibaatar	0af38b1034	Remove temp table to post autograd IR (#140085 ) This table is not needed Differential Revision: [D64553397](https://our.internmc.facebook.com/intern/diff/D64553397/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/140085 Approved by: https://github.com/justinchuby, https://github.com/bdhirsh	2024-11-11 23:59:09 +00:00
Felix Zimmermann	c223e0642c	Tighten type hints for tensor arithmetic (#135392 ) Fixes #124015 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135392 Approved by: https://github.com/ezyang	2024-11-11 23:55:27 +00:00
Bob Ren	a96aadf0a0	fix specialization logic in Scalar.h (#140280 ) Fixes `test/inductor/test_torchinductor_opinfo.py TestInductorOpInfoCUDA.test_comprehensive_linalg_norm_subgradients_at_zero_cuda_float64` when `specialize_float=False` Pull Request resolved: https://github.com/pytorch/pytorch/pull/140280 Approved by: https://github.com/ezyang	2024-11-11 23:51:15 +00:00
PyTorch MergeBot	222175b3d5	Revert "[Partitioner] Enumerate partitions by iterating partition ids (#136598 )" This reverts commit 2ede4c9a3858d6b97e2ba5156add0134b6765474. Reverted https://github.com/pytorch/pytorch/pull/136598 on behalf of https://github.com/kit1980 due to breaking internal ExecuTorch tests ([comment](https://github.com/pytorch/pytorch/pull/136598#issuecomment-2469294995))	2024-11-11 23:42:51 +00:00
PyTorch MergeBot	412df50454	Revert "[dynamo] Remove dead code path for capturing `__class__` in `UserFunctionVariable` (#140034 )" This reverts commit de40a23f6c02fd8d2b5046b5cab04582dc4ebc4e. Reverted https://github.com/pytorch/pytorch/pull/140034 on behalf of https://github.com/kit1980 due to breaking internal tests, see D65755044 ([comment](https://github.com/pytorch/pytorch/pull/140034#issuecomment-2469290205))	2024-11-11 23:38:00 +00:00
Kevin Sheridan	2817fe8bef	Add unaligned attributes to q8gemm/4x4c2-sse2.c (#140188 ) Summary: UBSan hits undefined behavior in this file. This fixes it by marking these pointers as unaligned. ``` caffe2/aten/src/ATen/native/quantized/cpu/qnnpack/__ukernels_sse2__/buck-private-headers/q8gemm/4x4c2-sse2.c:325:5: runtime error: store to misaligned address 0x62900313891f for type 'uint32_t' (aka 'unsigned int'), which requires 4 byte alignment 0x62900313891f: note: pointer points here be be be be be be be be be be be be be be be be be be be be be be be be be be be be be be be be ^ UndefinedBehaviorSanitizer: undefined-behavior buck-caffe2/aten/src/ATen/native/quantized/cpu/qnnpack/__ukernels_sse2__/buck-private-headers/q8gemm/4x4c2-sse2.c:325:5 in ``` The fix is to mark these variables as unaligned following D42179009's example q8gemm.cc + internal integration test Differential Revision: D65637959 Pull Request resolved: https://github.com/pytorch/pytorch/pull/140188 Approved by: https://github.com/digantdesai	2024-11-11 23:28:07 +00:00
Animesh Jain	5eb1ccadc2	[dynamo][user-defined] Walk __mro__ to get the member descriptor source (#140300 ) Fixes https://github.com/pytorch/pytorch/issues/140266 Pull Request resolved: https://github.com/pytorch/pytorch/pull/140300 Approved by: https://github.com/williamwen42	2024-11-11 23:16:48 +00:00
Nathan Brown	a290c1d748	Fix building with system GLOO (#140275 ) Leverage existing FindGloo CMake module to locate system's library and headers. Add system's gloo headers to include path rather than the gloo from third party when USE_SYSTEM_GLOO is specified. Fixes #140274 Pull Request resolved: https://github.com/pytorch/pytorch/pull/140275 Approved by: https://github.com/malfet	2024-11-11 22:58:39 +00:00
Catherine Lee	b742d11b1c	[TD] Filepath heuristic also looks at file name (#140170 ) Filepath heuristic also now takes into account the file name, not just directories A bit of refactoring Pull Request resolved: https://github.com/pytorch/pytorch/pull/140170 Approved by: https://github.com/huydhn	2024-11-11 22:55:54 +00:00
Animesh Jain	5f7ea7ca6a	[invoke_subgraph] Support symint/int as inputs (#140058 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/140058 Approved by: https://github.com/ydwu4, https://github.com/eellison ghstack dependencies: #139162	2024-11-11 22:26:43 +00:00
Xuan Zhang	d4cdc09881	ILP for auto FSDP wrapping (#140298 ) This PR presents a mixed integer linear programming (MILP) formulation that can be utilized to determine, under a memory budget, which modules to wrap as FSDP units. Similar to the auto SAC MILP introduced in https://github.com/pytorch/pytorch/pull/137908, the MILP uses information collected from MemTracker, Runtime Estimator, and SAC Estimator, introduced in these PRs: * https://github.com/pytorch/pytorch/pull/124688 * https://github.com/pytorch/pytorch/pull/134243 * https://github.com/pytorch/pytorch/pull/135208 End-to-end example and its sample output: ``` import copy from typing import Tuple import torch from torch._subclasses.fake_tensor import FakeTensorMode from torch.distributed._tools.ilp_utils import ( aggregate_stats, get_peak_memory_runtime_baseline, parse_module_info, ) from torch.distributed._tools.mem_tracker import _ModState, MemTracker from torch.distributed._tools.runtime_estimator import RuntimeEstimator from torch.distributed._tools.sac_estimator import SACEstimator from torch.distributed._tools.fsdp_ilp import fsdp_milp, CommType, CommParams from torch.testing._internal.distributed._tensor.common_dtensor import ( ModelArgs, Transformer, ) def _init_model_input_optimizer() -> ( Tuple[torch.nn.Module, torch.optim.Optimizer, torch.Tensor] ): bsz = 2 model_args = ModelArgs( n_layers=6, n_heads=12, vocab_size=8192, max_seq_len=1024, dim=6144, dropout_p=0.1, ) with torch.device(torch.cuda.current_device()): model = Transformer(model_args) optimizer = torch.optim.Adam(model.parameters(), lr=1e-2, foreach=True) inp = torch.randint( 0, model_args.vocab_size, (bsz, model_args.max_seq_len), device=torch.cuda.current_device(), ) return (model, optimizer, inp) def _run_and_get_mem_tracker( model: torch.nn.Module, optimizer: torch.optim.Optimizer, inp: torch.Tensor, ) -> MemTracker: mem_tracker = MemTracker() mem_tracker.track_external(model, optimizer) with mem_tracker as mt: for iter_idx in range(2): # running twice to initialize optimizer output = model(inp) output.sum().backward() if iter_idx == 1: last_snapshot = mt.get_tracker_snapshot("current") optimizer.step() optimizer.zero_grad() if iter_idx == 0: mt.reset_mod_stats() assert last_snapshot is not None for mod_stats in mem_tracker.memory_tracking.values(): if _ModState.POST_BW not in mod_stats.snapshots.keys(): mod_stats.snapshots.setdefault(_ModState.POST_BW, []).append( copy.deepcopy(last_snapshot) ) return mem_tracker def _run_and_get_runtime_estimator( model: torch.nn.Module, optimizer: torch.optim.Optimizer, inp: torch.Tensor, ) -> RuntimeEstimator: def _run_one_step() -> None: output = model(inp) output.sum().backward() optimizer.step() optimizer.zero_grad() # Initializing optimizer states and warm-up _run_one_step() runtime_estimator = RuntimeEstimator() with runtime_estimator(estimate_mode_type="operator-level-cost-model"): _run_one_step() # We use only one iteration for estimation return runtime_estimator def _run_and_get_sac_estimator( model: torch.nn.Module, inp: torch.Tensor, ) -> SACEstimator: sac_estimator = SACEstimator() with sac_estimator(estimate_mode_type="operator-level-cost-model"): loss = model(inp).sum() loss.backward() return sac_estimator def main(): with FakeTensorMode(): model, optimizer, inp = _init_model_input_optimizer() mem_tracker = _run_and_get_mem_tracker(model, optimizer, inp) runtime_estimator = _run_and_get_runtime_estimator(model, optimizer, inp) sac_estimator = _run_and_get_sac_estimator(model, inp) mod_info = aggregate_stats( model, mem_tracker, runtime_estimator, sac_estimator, torch.device(torch.cuda.current_device()), ) g = parse_module_info(mod_info) peak_mem, compute_time = get_peak_memory_runtime_baseline(g) print("=== WITHOUT FSDP ===") print(f"peak_mem: {round(peak_mem / 2*30, 2)} GiB") print(f"compute_time: {round(compute_time, 2)} ms") fsdp_decisions, exposed_comm_time, peak_mem = fsdp_milp( g, world_size=8, memory_budget=15, comm_params={ CommType.ALL_GATHER: CommParams(latency=0.01, bandwidth=2 1e8), CommType.REDUCE_SCATTER: CommParams(latency=0.01, bandwidth=2 * 1e8), }, ) print("=== WITH FSDP on 8 ranks ===") print(f"fsdp units: {sorted(fsdp_decisions)}") print(f"peak_mem: {round(peak_mem / 2**30, 2)} GiB") print(f"exposed communication time: {round(exposed_comm_time, 2)} ms") if __name__ == "__main__": main() ``` ``` === WITHOUT FSDP === peak_mem: 20.92 GiB compute_time: 1375.49 ms === WITH FSDP on 8 ranks === fsdp units: ['Transformer', 'Transformer.layers.0.attention.wk', 'Transformer.layers.0.attention.wo', 'Transformer.layers.0.attention.wq', 'Transformer.layers.0.attention.wv', 'Transformer.layers.0.feed_forward.w1', 'Transformer.layers.0.feed_forward.w2', 'Transformer.layers.1', 'Transformer.layers.2', 'Transformer.layers.3', 'Transformer.layers.4', 'Transformer.layers.5', 'Transformer.output', 'Transformer.pos_embeddings'] peak_mem: 13.63 GiB exposed communication time: 1.02 ms ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/140298 Approved by: https://github.com/weifengpy	2024-11-11 22:02:39 +00:00
Bin Bao	2c77352fe2	[AOTI][refactor] Clean up call chain in wrapper codegen (#136531 ) Summary: For cpp wrapper, generate_kernel_call and define_kernel need to handle both cpu and gpu kernels. Refactor the code to remove nested super() calls. Differential Revision: [D65639095](https://our.internmc.facebook.com/intern/diff/D65639095) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136531 Approved by: https://github.com/frank-wei	2024-11-11 22:00:42 +00:00
Huy Do	115c58c52a	Update ET pin for #6744 (#140199 ) This will be updated to ET trunk commit after https://github.com/pytorch/executorch/pull/6744 lands. I also move ET back from unstable and install llama3 dependencies Pull Request resolved: https://github.com/pytorch/pytorch/pull/140199 Approved by: https://github.com/kit1980	2024-11-11 21:40:12 +00:00
Justin Chu	780b28f67e	[ONNX] Update docstring typo in building (#140281 ) The oprecorder docstring mistakenly referred to torchscript when it should say ONNX IR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/140281 Approved by: https://github.com/titaiwangms	2024-11-11 21:01:27 +00:00
Jack Taylor	001f7366a7	[ROCm] Correct numerical issues in layer norm backwards kernel (#140259 ) It was raised that the backwards layer norm on AMD was slightly off the accuracy of the equivalent NVIDIA implementation. On AMD we call into a helper kernel `cuLoadWriteStridedInputs` which processes strided input and accumulates the partial gradients into shared memory. In this kernel (https://github.com/pytorch/pytorch/pull/87635) we truncated `mean` and `rstd` from T_ACC type to T which causes numerical issues in the warp buffers created in this kernel. This PR will use the correct accumulator type for mean and rstd. Note: Only AMD call into this call stack for backwards layer norm, so this was not an issue for NV. Pull Request resolved: https://github.com/pytorch/pytorch/pull/140259 Approved by: https://github.com/jianyuh	2024-11-11 20:44:18 +00:00
Rachel Guo	10e40dd5ca	[aoti][tooling] Add support to debug printing for all AOTI model run input args (#140064 ) Summary: Add debug printing around: `void AOTInductorModel::run_impl()` Example: ``` void AOTInductorModel::run_impl( AtenTensorHandle* input_handles, // array of input AtenTensorHandle; handles // are stolen; the array itself is borrowed AtenTensorHandle* output_handles, // array for writing output AtenTensorHandle; handles // will be stolen by the caller; the array itself is // borrowed DeviceStreamType stream, AOTIProxyExecutorHandle proxy_executor ) { auto inputs = steal_from_raw_handles_to_raii_handles(input_handles, 3); auto arg0_1 = std::move(inputs[0]); auto arg1_1 = std::move(inputs[1]); auto arg2_1 = std::move(inputs[2]); aoti_torch_print_tensor_handle(arg0_1, "aoti_model_inputs - arg0_1"); aoti_torch_print_tensor_handle(arg1_1, "aoti_model_inputs - arg1_1"); aoti_torch_print_tensor_handle(arg2_1, "aoti_model_inputs - arg2_1"); ``` Differential Revision: D65616590 Pull Request resolved: https://github.com/pytorch/pytorch/pull/140064 Approved by: https://github.com/chenyang78	2024-11-11 20:10:35 +00:00
Yuanhao Ji	7f1e248b50	[Dynamo] Replace `torch._dynamo.optimize()` with `torch.compile()` [1/N] (#139706 ) ``torch._dynamo.optimize()`` is wrapped for convenience by ``torch.compile()``. related commits: - #139706 - #140238 - #140247 - #140253 Pull Request resolved: https://github.com/pytorch/pytorch/pull/139706 Approved by: https://github.com/jansel, https://github.com/ezyang	2024-11-11 20:04:08 +00:00
Joel Schlosser	e7ec294c10	NJT OpInfo tests v2 (#138370 ) This PR updates OpInfo-based tests for NJTs: * Adds extensive coverage across non-contiguous NJTs (both non-contiguous transposed and non-contiguous with holes) * The `_sample_njts()` helper that `sample_input_func`s utilize now produces non-contig NJTs as well * Utilizes a `SampleInput`-based xfail system for granular classification of bugs. For example, it's possible to indicate that a class of ops is expected to fail only on non-contig with holes NJT inputs. * I decided on adding `SampleInput`s and utilizing this system over using test parametrization for two reasons: * Test perf - adding `SampleInput`s is faster than generating entire new tests * Avoiding the possibility of `sample_input_func`s not respecting the non-contig test parameter - this would result in silently incorrect passing of these tests. Keeping the responsibility for `SampleInput` generation firmly within each `OpInfo`'s `sample_input_func` means weirdness like this isn't possible * Improves `SampleInput` naming for a bunch of `sample_input_func`s. This makes it easier to xfail them as needed. For example, binary / unary / other ops now use the new `_describe_njt()` helper to get a string repr that uniquely defines the type of NJT being passed to the op * Adds appropriate `XFailRule`s to get tests passing for forward / backward / forward compile / backward compile. In general, each xfail corresponds to some bug that needs to be fixed ```python # Represents a rule indicating how to xfail a particular test. It allows granularity # at the device, dtype, op, and individual sample levels. This flexibility allows entire # bugs to be represented by a single rule, even if this corresponds with multiple conceptual # test cases across multiple ops. @dataclass class XFailRule: # expected error type error_type: TypeVar = Exception # expected error message error_msg: str = "." # function to indicate whether the rule applies; return True if so match_fn: Callable[[torch.device, torch.dtype, OpInfo, SampleInput], bool] = None # optional name for identifying the rule name: str = "" def match(self, device, dtype, op, sample) -> bool: return self.match_fn(device, dtype, op, sample) ``` Example: ```python # Bug when broadcasting a binary op with non-contiguous with holes NJT + dense # tensor with 1 in ragged dim. XFailRule( error_type=RuntimeError, error_msg="cannot call binary pointwise function . with inputs of shapes", match_fn=lambda device, dtype, op, sample: ( isinstance(op, BinaryUfuncInfo) and "noncontig_holes" in sample.name and "broadcasting 1 over ragged" in sample.name ), name="binary_noncontig_holes_broadcasting_1_over_ragged", ), ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/138370 Approved by: https://github.com/cpuhrsch, https://github.com/soulitzer ghstack dependencies: #140160	2024-11-11 19:35:24 +00:00
Yifu Wang	0a0915fb5e	[SymmetricMemory] improve the API for stream_write_value32 (#139934 ) This PR updates the binding for `stream_write_value32` to be consistent with `memset32` which IMO makes more sense for this type of utilities: - Changed the API to take a uint32 tensor as argument, instead of a device pointer - Changed the Python binding to be a static method of `_SymmetricMemory`, instead of a object method - Use the dispatcher for device dispatching, as opposed to `SymmetricMemory` backends Pull Request resolved: https://github.com/pytorch/pytorch/pull/139934 Approved by: https://github.com/weifengpy ghstack dependencies: #139227	2024-11-11 18:49:22 +00:00
Max Ren	96b64182de	Delete Buck1 as it is no longer supported (#140067 ) Buck1 is no longer supported in favor of buck2. This CI tests the old buck1 flow, however it is difficult to maintain especially since buck1 doesn't support aarch64 mac. I am suggesting that this CI be deprecated until a decision on buck2 is made, and buck2 support is added. As of now, there seems to be no push towards adding buck2 support. Pull Request resolved: https://github.com/pytorch/pytorch/pull/140067 Approved by: https://github.com/huydhn	2024-11-11 18:49:18 +00:00
PyTorch MergeBot	5f4a21dc58	Revert "[SymmetricMemory] improve the API for stream_write_value32 (#139934 )" This reverts commit 2f3a5a15ef701ffab9a880cf822ff8e5224a4b33. Reverted https://github.com/pytorch/pytorch/pull/139934 on behalf of https://github.com/malfet due to Broke distributed tests, see https://github.com/pytorch/pytorch/actions/runs/11770673088/job/32784210441 ([comment](https://github.com/pytorch/pytorch/pull/139934#issuecomment-2468641512))	2024-11-11 17:02:07 +00:00
Nikita Shulga	2fe110ff3a	[BE][MPS] Standardize indexing shader compilation (#140271 ) It was wrong to add it to MPSDevice in the first place, as in the end it's just a regular shader, like all others. I.e. this PR: - Moves contents of `at::mps::indexing_metal_shaders` into `kernels/Indexing.metal` - Deletes `MPSDevice::getMetalIndexingLibrary()` and `MPSDevice::metalIndexingPSO` methods - Moves `at::native::mps::generateKernelDataOffsets` implementation from `OperationUtils.mm` to `Indexing.mm` Pull Request resolved: https://github.com/pytorch/pytorch/pull/140271 Approved by: https://github.com/Skylion007	2024-11-11 17:00:49 +00:00
Nikita Shulga	f5ffd55a32	[MPS] Add `torch.special.i1` op (#140196 ) By more-or-less copy-n-pasting `58b661cda2/aten/src/ATen/native/cuda/Math.cuh (L576)` Enable respective tests in test_mps.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/140196 Approved by: https://github.com/Skylion007	2024-11-11 16:57:53 +00:00
Aleksei Nikiforov	63715f6567	S390x update builder image (#132983 ) Publish current state of s390x builder image to allow reproducing worker setup. Also, if this image gets published to docker repository later, it'd be possible to download published image instead of building it into worker image in https://github.com/pytorch/pytorch/blob/main/.github/scripts/s390x-ci/self-hosted-builder/actions-runner.Dockerfile#L66, which should allow improving restart time at the cost of additional runtime overhead. Compared to first attempt to merge: - default docker repository settings are added to all runners. Changes are mirrored in this PR. - job is moved into separate workflow file. - it's no longer attempted to update limits on s390x. Limits should be properly set up there on the host. And it's not possible to update them from worker since it runs in container. Also, worker container currently doesn't have sudo installed or configured or any systemd running. - github token is now passed once via named pipe instead of environment variable. This should increase security of tokens. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132983 Approved by: https://github.com/huydhn, https://github.com/malfet	2024-11-11 16:14:06 +00:00
Richard Zou	04b5b4a94e	Add base class for single-subgraph inductor HOPs (#139898 ) This PR adds "PrimHOPBase", which is intended to be a base class that one can extend to create new HOPs that match some criteria: - they take one subgraph as input, and their semantics are running the subgraph on some operands - the HOP stays alive until Inductor The motivation is that we are seeing a lot more HOPs (invoke_subgraph, invoke_quant) that have this property and there can be a lot of shared code between them. Future: - Migrate invoke_subgraph to use this - There are some TODOs in the code Test Plan: - new tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/139898 Approved by: https://github.com/anijain2305, https://github.com/ydwu4	2024-11-11 16:12:35 +00:00
David Berard	d4b8857e51	[codecache][triton 3.2] hash -> base64 conversion for triton 3.2 (#140190 ) In old triton versions, you take the hash of the triton kernel and use it in the filepath for the cached kernel. In Triton 3.2 (after https://github.com/triton-lang/triton/pull/4553), the filepath will use the base-64-encoded representation of the hash in the path. This PR checks whether the `_base64` function exists in triton, and if so, uses the base-64-encoded represenatation in the path. Pull Request resolved: https://github.com/pytorch/pytorch/pull/140190 Approved by: https://github.com/ezyang	2024-11-11 15:32:28 +00:00
fduwjj	ceb44b22dc	[FR] Enable best effort parital analysis and verbose mode for trace printing (#139853 ) Based on user feedback, we want to enable two things for FR analysis script: 1. Print out more information when verbose is specified. 2. Perform best effort based analysis when not all ranks have FR trace dumped. Differential Revision: [D65516081](https://our.internmc.facebook.com/intern/diff/D65516081/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139853 Approved by: https://github.com/c-p-i-o	2024-11-11 14:38:32 +00:00
Sam Larsen	cb15c15157	[logging] Overhaul dynamo_timed and CompilationMetrics logging. (#139849 ) Here's the overview: There's a new contextmanager singleton called MetricsContext. Entering the MetricsContext is how we demarcate the boundary on which we'll create a single CompilationMetrics object, and therefore, a single dynamo_compile log entry. While we're inside the MetricsContext, we can update/set many different metrics. Most importantly: `dynamo_timed` can also update the in-progress MetricsContext. In the proposal here, we tell `dynamo_timed` that we want it to do so by providing the name of the MetricsContext field to increment. There can be many `dynamo_timed` calls in different parts of the code updating different fields. Then when the MetricsContext exits, that's when the logging of everything gathered finally happens. One potential footgun is trying to use `dynamo_timed` when we haven't entered the MetricsContext, but we assert on that problem. Another problem is that we re-enter the context recursively, but we watch for that and do the logging only when the outermost exits. Some specifics: * Introduce MetricsContext - a context manager that on exit, records the CompilationMetrics (which also logs to dynamo_compile). * Completely remove the concept of frame_phase_timing. Instead, update the MetricsContext during compilation, either directly or via dynamo_timed. * Remove some globals we previously used to accumulate counters to later populate a CompilationMetrics. We use CompilationMetrics set/update/increment APIs instead. * `record_compilation_metrics` is now called on exit from MetricsContext. * Populate legacy CompilationMetrics fields right before logging, inside `record_compilation_metrics`. * Remove the one-off `add_remote_cache_time_saved` helper; capture that timing directly into the MetricsContext. And specifically, several changes to dynamo_timed: * "Modernize" the parameters and update all callsites accordingly. * Move the backwards logging of the CompilationMetrics to the backwards compile location. * Add a parameter for which CompilationMetrics field to update Pull Request resolved: https://github.com/pytorch/pytorch/pull/139849 Approved by: https://github.com/ezyang ghstack dependencies: #140094	2024-11-11 14:24:23 +00:00
Xiaodong Wang	565a7942ee	Recover non-standard bool test for msort (#139870 ) Summary: I was looking into why the non-standard bool value will fail for msort - it makes sense for argsort and sort to fail, because we're randomly generating uint8 so the order will be different (and thus the indices will be different). But msort should work. After some digging, it's interesting that even though scalar_t is bool, when the actual value is a uint8_t, the comparison will treat them as signed. I tried lhs=255 and rhs=0: lhs < rhs is equivalent to -1 < 0 which is true (but it's supposed to be False) Therefore we add an explicit type cast. Test Plan: Remove the test skip Differential Revision: D65472170 Pull Request resolved: https://github.com/pytorch/pytorch/pull/139870 Approved by: https://github.com/Skylion007, https://github.com/davidberard98	2024-11-11 02:00:34 +00:00
Yifu Wang	2f3a5a15ef	[SymmetricMemory] improve the API for stream_write_value32 (#139934 ) This PR updates the binding for `stream_write_value32` to be consistent with `memset32` which IMO makes more sense for this type of utilities: - Changed the API to take a uint32 tensor as argument, instead of a device pointer - Changed the Python binding to be a static method of `_SymmetricMemory`, instead of a object method - Use the dispatcher for device dispatching, as opposed to `SymmetricMemory` backends Pull Request resolved: https://github.com/pytorch/pytorch/pull/139934 Approved by: https://github.com/weifengpy ghstack dependencies: #139227	2024-11-11 01:54:35 +00:00
cyy	ffb979032d	[7/N] Fix Wextra-semi warning (#140225 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/140225 Approved by: https://github.com/ezyang	2024-11-10 14:28:10 +00:00
Zhenbin Lin	d90c25e3e2	OpenReg: Support event (#140111 ) Support events. Since cpu backend doesn't support asynchronous execution, all event operations will be executed immediately on the executor side. Pull Request resolved: https://github.com/pytorch/pytorch/pull/140111 Approved by: https://github.com/ezyang	2024-11-10 08:38:45 +00:00
Yutao Xu	c3087ace58	Update torch-xpu-ops commit pin (#139986 ) Update the torch-xpu-ops commit to [5e29831 ](https://github.com/intel/torch-xpu-ops/commit/5e29831). Includes: - OneAPI-2025 build issue fix - Enhancement of the XPU operator coverage Pull Request resolved: https://github.com/pytorch/pytorch/pull/139986 Approved by: https://github.com/guangyey, https://github.com/jansel	2024-11-10 06:49:38 +00:00
CaoE	94c9bb73c0	[Inductor] [CPP] Update BRGEMM parameters for Half cpp gemm template (#140116 ) Update BRGEMM parameters for Half cpp gemm template as BRGEMM api is changed https://github.com/pytorch/pytorch/pull/138184. Pull Request resolved: https://github.com/pytorch/pytorch/pull/140116 Approved by: https://github.com/jansel	2024-11-10 06:37:10 +00:00
Sam Larsen	4f6b30bcbc	Add testing for the utils surrounding dynamo_timed (#140094 ) Summary: This will make it easier to verify that we don't break these utilities for the refactor in https://github.com/pytorch/pytorch/pull/139849. It's one giant test. I can split it into multiple for better readability if ppl prefer that. My rationale for the giant test is that I found I was just resetting compilation and recompiling the same thing many times, which was slow and wasteful. Test Plan: The new tests Differential Revision: [D65682138](https://our.internmc.facebook.com/intern/diff/D65682138) Pull Request resolved: https://github.com/pytorch/pytorch/pull/140094 Approved by: https://github.com/ezyang	2024-11-10 04:17:45 +00:00
zeshengzong	5ef33e40b3	Add size param check of unfold (#139965 ) Fixes #76617 Changes: - Add check of input `size` value, give user friendly hint message - fix `FIXME: move to shape ops test suite` in test file Before ```python import torch x = torch.arange(1., 8) x.unfold(0, -1, 1) Traceback (most recent call last): File "/home/zong/code/unfold.py", line 12, in <module> x.unfold(0, -1, 1) RuntimeError: Storage size calculation overflowed with sizes=[9, -1] and strides=[1, 1] ``` After ```python import torch x = torch.arange(1., 8) x.unfold(0, -1, 1) Traceback (most recent call last): File "/home/zong/code/pytorch/../unfold.py", line 12, in <module> x.unfold(0, -1, 1) RuntimeError: size is -1 but must be >= 0 ``` Test Result: ```bash pytest test/test_shape_ops.py ``` ![image](https://github.com/user-attachments/assets/d7bcef62-04e6-4187-9c8f-bc5220ff6c33) ```bash $ lintrunner ``` ![image](https://github.com/user-attachments/assets/6b48d095-5c8a-4e75-9957-dc22d39a73bb) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139965 Approved by: https://github.com/ezyang	2024-11-09 17:12:53 +00:00
atalman	f89b2b9630	Refactor conda-builder -> almalinux-builder (#140157 ) This changes the conda-builder workflow to almalinux-builder and switches Docker file to almalinux. Please note: Published conda-builder images will still be available, hence workflows that use these images will still work. We will be switching workflows that use conda-builder images to almalinux-builder Pull Request resolved: https://github.com/pytorch/pytorch/pull/140157 Approved by: https://github.com/malfet	2024-11-09 16:06:40 +00:00
cyy	7d4f5f7508	[Environment Variable][6/N] Use thread-safe getenv functions (#140200 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/140200 Approved by: https://github.com/ezyang	2024-11-09 15:05:51 +00:00
Nikita Shulga	a2ac96cae0	[BE] Rectify some references to caffe2 (#140204 ) - Rename `tools.build_pytorch_libs.build_caffe2` to `tools.build_pytorch_libs.build_pytorch` - Delete number of `if BUILD_CAFFE2` conditions Pull Request resolved: https://github.com/pytorch/pytorch/pull/140204 Approved by: https://github.com/huydhn, https://github.com/r-barnes, https://github.com/atalman	2024-11-09 14:14:20 +00:00
fduwjj	5107d244ee	[c10d][Logging] Remove args and kwargs from c10d logging (#140169 ) This PR is trying to reland https://github.com/pytorch/pytorch/pull/139804 We now don't want to log args and kwargs directly because if they contain tensor or tensor subclass it would take lots of time in conversion to string or even not supported. Pull Request resolved: https://github.com/pytorch/pytorch/pull/140169 Approved by: https://github.com/wz337, https://github.com/kwen2501	2024-11-09 13:57:32 +00:00
Yu, Guangye	052b67e2b4	Add torch.version.xpu (#139466 ) # Motivation We add a new attribute `torch.version.xpu` to facilitate the problem diagnosing and version control. # Additional Context It is aligned with `torch.version.cuda` and `torch.version.hip`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139466 Approved by: https://github.com/EikanWang, https://github.com/ezyang, https://github.com/atalman, https://github.com/malfet ghstack dependencies: #139258	2024-11-09 13:31:21 +00:00
Yu, Guangye	8051ee802c	Add XPU compiler version control in cmake to keep BC (#139258 ) # Motivation This PR aims to maintain backward compatibility when building PyTorch XPU with the old and new compilers. # Additional Context The details are described here. The new compiler (2025.0.0) has some breaking changes compared with the old compiler(2024.1), for examples: 1. On Windows, sycl library is named `sycl7.lib` in the old compiler but is named `sycl.lib` in the new compiler. 2. On Linux, in order to support ABI=0, we have to link `libsycl-preview.so` in the old compiler but we could link `libsycl.so` in the new compiler to have the same ABI compatibility. 3. We added a macro `SYCL_COMPILER_VERSION` to support our new code has good backward compatibility with the old compiler. Now the new feature(Event elapsed_time, memory summary, and device architecture property) introduced by the new compiler will be controlled within the macro `SYCL_COMPILER_VERSION`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139258 Approved by: https://github.com/EikanWang, https://github.com/atalman, https://github.com/gujinghui	2024-11-09 13:31:21 +00:00
xinan.lin	191971e01d	[AOTI] Introduce an extensibility mechanism for the c shim codegen to make it easy to produce c shims for out-of-tree OP kernels as well. Add c_shim for XPU. (#136742 ) [AOTI] Introduce an extensibility mechanism for the c shim codegen to make it easy to produce c shims for out-of-tree OP kernels as well. Add c shim for XPU. ### Motivation Since the current c shim codegen will only produce C wrappers for Op's registered in `aten/src/ATen/native/native_functions.yaml`, for the same backend, when a portion of out-of-tree OP's are not registered in that file, but are registered externally. For example, `third_party/torch-xpu-ops/yaml/native_functions.yaml` , in this case, the existing codegen can't fulfill the need to do extensions for the c shims from the out-of-tree OPs for the in-tree that has already been produced. ### Design To extend the c shim with more OP for a backend from out-of-tree. The PR provided a bool option `--aoti-extend` to indicate the codegen is to extend c shim from out-of-tree. The generated c shim is stored in the `extend` subdirectory , for example: ``` torch/include/torch/csrc/inductor/aoti_torch/generated/c_shim_xpu.h torch/include/torch/csrc/inductor/aoti_torch/generated/c_shim_xpu.cpp torch/include/torch/csrc/inductor/aoti_torch/generated/extend/c_shim_xpu.h torch/include/torch/csrc/inductor/aoti_torch/generated/extend/c_shim_xpu.cpp ``` example usage: `python -m torchgen.gen --source-path third_party/torch-xpu-ops/yaml/ --xpu --aoti-extend --update-aoti-c-shim ` `--xpu`: generate c shim for XPU `--aoti-extend `: this is an out-of-tree OPs(defined in `third_party/torch-xpu-ops/yaml/native_functions.yaml`) extend for in-tree ops(defined in `aten/src/ATen/native/native_functions.yaml`) `--update-aoti-c-shim`: always generate c_shim_xpu.h for the extend c_shim. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136742 Approved by: https://github.com/EikanWang, https://github.com/desertfire ghstack dependencies: #139025	2024-11-09 13:19:52 +00:00
xinan.lin	929a647363	[Intel GPU] Support RegisterXPU.cpp codegen and compile for the in-tree XPU structured GEMM OPs. (#139025 ) [Intel GPU] Support RegisterXPU.cpp codegen and compile for the in-tree XPU structured GEMM ops. Motivation: There are two parts of aten ops for XPU, one is in-tree ops like GEMM related OPs and the other is out-off-tree ops in torch-xpu-ops. For the in-tree part，since Pytorch uses native_functions.yaml registration and is equipped with convenient codegen capabilities, we want to take advantage of these benefits as well. At the same time, since AOT Inductor also uses native_functions.yaml to generate c shim wrappers, we also need to enable this mechanism for XPU. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139025 Approved by: https://github.com/EikanWang, https://github.com/jansel, https://github.com/desertfire	2024-11-09 13:09:27 +00:00
Andrea Frittoli	0b650c360a	Build magma for windows (#139924 ) Copy the magma for windows job and script from pytorch/builder `c9aac65e12/.github/workflows/build-magma-windows.yml` The linux version is moved here in https://github.com/pytorch/pytorch/pull/139888 Fixes #140001 Pull Request resolved: https://github.com/pytorch/pytorch/pull/139924 Approved by: https://github.com/atalman	2024-11-09 09:27:59 +00:00
Boyuan Feng	e2e425b4f3	[CUDAGraph] Add dynamo timer to checkpoint, warmup, and record (#139818 ) Summary: Add time log to cudagraph, including `create deferred_cudagraphify wrapper`, `warmup`, `record`, and `checkpoint`. Test Plan: 1. buck2 run fbcode//mode/opt //pytorch/benchmark:run -- resnet50 -d cuda -t train --inductor --pt2-triton-cudagraph 2. Found the result in [scuba table](https://fburl.com/scuba/pt2_compile_events/0oik8nu9). {F1954034920} Differential Revision: D65505659 Pull Request resolved: https://github.com/pytorch/pytorch/pull/139818 Approved by: https://github.com/eellison	2024-11-09 05:27:11 +00:00
cyy	ab55a99283	Use TORCH_DECLARE_XXX (#139952 ) Because those files use TORCH_API Pull Request resolved: https://github.com/pytorch/pytorch/pull/139952 Approved by: https://github.com/ezyang	2024-11-09 04:56:28 +00:00
Kefei Lu	d2d1258b1b	Speed up AMD AOT Inductor lowering by memoizing hipify trie to regex logic (#140156 ) Summary: AMD lowering duration is 1.55x longer than H100. Profiling shows hipification related functions took 22% of overall lowering time. This diff cuts that time by safely memoize the trie to regex logic. The trick is to incrementally build a state of the trie during the trie construction. The state is the hash of all the words added to the trie. Differential Revision: D65659445 Pull Request resolved: https://github.com/pytorch/pytorch/pull/140156 Approved by: https://github.com/ColinPeppler Co-authored-by: Kefei Lu <kefeilu@meta.com>	2024-11-09 04:28:58 +00:00
Michael Lazos	8b2e3855a9	Make size a property with an assertion (#139794 ) Fixes https://github.com/pytorch/pytorch/issues/120568 Pull Request resolved: https://github.com/pytorch/pytorch/pull/139794 Approved by: https://github.com/williamwen42	2024-11-09 03:39:41 +00:00
cyy	032135f8a2	[2/N] Turn inline static functions into static (#140068 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/140068 Approved by: https://github.com/ezyang	2024-11-09 03:31:24 +00:00
Bob Ren	3b8470c461	add special case for __round__ constant variables (#139583 ) Fixes `PYTORCH_TEST_WITH_INDUCTOR=1 tlp python test/test_torch.py TestTorchDeviceTypeCUDA.test_cauchy_cuda_float64` when specialize_float=False Pull Request resolved: https://github.com/pytorch/pytorch/pull/139583 Approved by: https://github.com/ezyang ghstack dependencies: #139569, #139457, #139568, #139572, #139846, #139454, #139896, #139935, #139587	2024-11-09 03:25:53 +00:00
Florian (Feuermagier)	f915409c26	FlopCounterMode: Decompose ops for inference mode (#138508 ) Fixes #126268 I've basically followed @ezyang suggestion (I think) to use `func.decompose(...)`. Since `__torch_dispatch__` won't be called a second time for the same op, I've added a second `TorchDispatchMode` (`_DecomposedCounterMode`) that simpy dispatches to the parent flop counter. Using `self` as the inner context manager is not possible, since the second call to `__enter__` would re-initialize the counter's tracking state. Let me know if there's something wrong with this implementation, since I'm quite unsure how the decomposition thing actually works :D Pull Request resolved: https://github.com/pytorch/pytorch/pull/138508 Approved by: https://github.com/ezyang	2024-11-09 03:13:53 +00:00
Bob Ren	4488e23763	Fix another item memo loss location + bool specialization bug (#139587 ) This fix was a bit more involved: 1) It fixes a item_memo loss place. 2) It updates a test to be eager instead of aot_eager since it reveals a very obscure bug related to replacements that's not worth solving since in practice inductor will regenerate the runtime asserts anyways 3) It updates tensorify to specialize more places now that the aforementioned bug is fixed. Fixes `PYTORCH_OPINFO_SAMPLE_INPUT_INDEX=6 python test/inductor/test_torchinductor_opinfo.py TestInductorOpInfoCPU.test_comprehensive_linalg_norm_cpu_float16` when `specialize_float=False` while ensuring `python test/dynamo/test_dynamic_shapes.py DynamicShapesMiscTests.test_runtime_assert_replacement_dynamic_shapes` doesn't regress Pull Request resolved: https://github.com/pytorch/pytorch/pull/139587 Approved by: https://github.com/ezyang ghstack dependencies: #139569, #139457, #139568, #139572, #139846, #139454, #139896, #139935	2024-11-09 03:11:19 +00:00
wz337	4893e248a8	[DTensor][Test] Remove safe global context for weights_only torch.load() DTensor (#140173 ) We have added DTensor related classes to allowed globals so we can torch.load(DTensor) with weights_only=True. So we don't need the safe_globals context for this test anymore. Pull Request resolved: https://github.com/pytorch/pytorch/pull/140173 Approved by: https://github.com/mikaylagawarecki ghstack dependencies: #139949	2024-11-09 02:21:44 +00:00
Andrea Frittoli	72976b2486	Use manylinux-builder images with main tag (#140158 ) The magma build uses deprecated manylinux-builder images. Update it to use the images with "main" in the tag: pytorch/manylinux-builder:cuda<version>-main Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/140158 Approved by: https://github.com/atalman	2024-11-09 02:16:00 +00:00
Zhou, Lingzhi	2ede4c9a38	[Partitioner] Enumerate partitions by iterating partition ids (#136598 ) Currently, we get all partition id by iterating assignment whose size is same as the number of nodes in graph. But we can reach same results by iterating partitions_by_id whose size is much smaller than the nodes number. Assume the number of nodes is N, the number of partitions is P, the time complexity decrease from O(N * N) to O(N * P) after this patch. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136598 Approved by: https://github.com/ezyang Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>	2024-11-09 01:31:46 +00:00
Joel Schlosser	9c678af9f9	Misc. non-contig NJT fixes (#140160 ) This PR contains several fixes related to non-contiguous NJTs: 1. Propagates `lengths` through op calls appropriately (see desc of #138098) * SDPA now calls `nested_view_from_values_offsets_lengths()` instead of `nested_view_from_values_offsets()` 2. Allows non-contig NJTs in unsqueeze / transpose / select 3. Expands padded dense -> NJT conversion to support non-contig NJTs 4. (unrelated sorry) Updates `split` / `split_with_sizes` to allow for optional `dim`, matching the ATen signature Pull Request resolved: https://github.com/pytorch/pytorch/pull/140160 Approved by: https://github.com/cpuhrsch	2024-11-09 01:18:26 +00:00
William Wen	be172d2a60	[pt2, docs] Add new PT2 troubleshooting doc (#138620 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138620 Approved by: https://github.com/ezyang Co-authored-by: Svetlana Karslioglu <svekars@meta.com>	2024-11-09 01:17:39 +00:00
Ryan Guo	de40a23f6c	[dynamo] Remove dead code path for capturing `__class__` in `UserFunctionVariable` (#140034 ) This was introduced in https://github.com/pytorch/torchdynamo/commit/d0c10341 as limited support for pre-existing cells, since we know `__class__` wouldn't be modified in most cases. It's no longer needed now that we have much more support for these cells. Example: ```python class Foo(): def __init__(self): super().__init__() print(Foo.__init__.__code__.co_freevars) # ('__class__',) print(Foo.__init__.__closure__) # (<cell at 0x1011fb310: type object at 0x10fe185b0>,) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/140034 Approved by: https://github.com/williamwen42, https://github.com/anijain2305, https://github.com/jansel ghstack dependencies: #140033	2024-11-09 01:03:24 +00:00
Ryan Guo	0b8652a999	[dynamo] Remove `NestedUserFunctionVariable.closure_scope` (#140033 ) This was no longer needed after https://github.com/pytorch/torchdynamo/commit/663e4d92, which removed the uses of `closure_scope` but not the field itself. Pull Request resolved: https://github.com/pytorch/pytorch/pull/140033 Approved by: https://github.com/williamwen42, https://github.com/anijain2305, https://github.com/jansel	2024-11-09 01:03:24 +00:00
cyy	263d8f7a94	[8/N] Don't skip ASAN on some tests (#140081 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/140081 Approved by: https://github.com/ezyang	2024-11-09 01:00:13 +00:00
PyTorch MergeBot	58b661cda2	Revert "[c10d][Logging] Remove args and kwargs from c10d logging (#140169 )" This reverts commit e3b2f04f052fbc5dcf728f33ac59917d087c324c. Reverted https://github.com/pytorch/pytorch/pull/140169 on behalf of https://github.com/ZainRizvi due to Man, this test really wants to fail on trunk. Sorry. Details: distributed/test_c10d_logger.py::C10dErrorLoggerTest::test_exception_logger [GH job link](https://github.com/pytorch/pytorch/actions/runs/11751023962/job/32740983427) [HUD commit link](`e3b2f04f05`) ([comment](https://github.com/pytorch/pytorch/pull/140169#issuecomment-2465933413))	2024-11-09 00:23:43 +00:00
Peter Steinbach	090b778b8a	Clarify meaning of rate parameter in Gamma distribution (#134847 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134847 Approved by: https://github.com/fritzo	2024-11-09 00:22:13 +00:00
PyTorch MergeBot	7eb66173e2	Revert "Fix split decomp returning self (#140065 )" This reverts commit 9d99dceb53884387665a2c273beca99a157193a5. Reverted https://github.com/pytorch/pytorch/pull/140065 on behalf of https://github.com/ZainRizvi due to Diff been imported internally, but merged externally. And the internal diff has been updated so the diff and PR are now mismatched. Reverting this PR to get things back into a consistent state. See D65635070 ([comment](https://github.com/pytorch/pytorch/pull/140065#issuecomment-2465928027))	2024-11-09 00:16:26 +00:00
Mengwei Liu	a02e88d19c	[miniz] Bump miniz version to 3.0.2 and add patch for zip64 (#140041 ) Summary: Bump miniz version from 2.1.0 to 3.0.2 and apply these patches: * #79636 patches internal BUCK and bazel build * #138959 adds `bool compute_crc32` argument * miniz PR: https://github.com/richgel999/miniz/pull/324 to support zip64 Anyone bumping miniz version again, please apply these patches as well. Test Plan: Rely on unit test Imported from OSS Differential Revision: D65586230 Pull Request resolved: https://github.com/pytorch/pytorch/pull/140041 Approved by: https://github.com/mikaylagawarecki	2024-11-09 00:13:16 +00:00
PyTorch MergeBot	1400fedf76	Revert "add supports_coalescing property in c10d::Backend to determine whether backend supports coalescing (#135338 )" This reverts commit e5574445b01f264e57653a8a42af1118e89acc9a. Reverted https://github.com/pytorch/pytorch/pull/135338 on behalf of https://github.com/ZainRizvi due to Sorry but this is failing internally. Please see D65663382 for more details ([comment](https://github.com/pytorch/pytorch/pull/135338#issuecomment-2465911854))	2024-11-08 23:52:49 +00:00
Michael Lazos	ea0f60ecfa	[Dynamo] allow dynamic callables on tensor variables (#137940 ) Fixes https://github.com/pytorch/pytorch/issues/134844 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137940 Approved by: https://github.com/williamwen42	2024-11-08 23:49:34 +00:00
PyTorch MergeBot	beae7725be	Revert "Tighten type hints for tensor arithmetic (#135392 )" This reverts commit d3788190685685cb828bdf6bed90270c0b60affc. Reverted https://github.com/pytorch/pytorch/pull/135392 on behalf of https://github.com/ZainRizvi due to Sorry but this is breaking internally. See D65641103 for more details ([comment](https://github.com/pytorch/pytorch/pull/135392#issuecomment-2465906839))	2024-11-08 23:44:41 +00:00
Haifeng Jin	2af5172774	fix dynamo tracking numpy 2 ops (#138686 ) Fixes #136559 As we upgrade to NumPy 2, torch falsely filtered out `numpy.random` as unsupported in dynamo tracking. This PR changes the filtering rules to include them while keeping behavior with numpy 1 unchanged. Before this PR, the following tests failed: ``` PYTORCH_TEST_WITH_ASAN=1 PYTORCH_TEST_WITH_UBSAN=1 python test/dynamo/test_functions.py -k FunctionTests.test_numpy_random PYTORCH_TEST_WITH_ASAN=1 PYTORCH_TEST_WITH_UBSAN=1 python test/dynamo/test_unspec.py -k UnspecTests.test_to_tensor PYTORCH_TEST_WITH_ASAN=1 PYTORCH_TEST_WITH_UBSAN=1 python test/test_fake_tensor.py -k FakeTensorTest.test_export_numpy PYTORCH_TEST_WITH_ASAN=1 PYTORCH_TEST_WITH_UBSAN=1 python test/test_fake_tensor.py -k PropagateRealTensorsFakeTensorTest.test_export_numpy_propagate_real_tensors ``` With this PR, the supported/unsupported ops in NumPy 1 are not changed. For NumPy 2, only the `numpy.random` ops that are already supported with NumPy 1 are added to the supported list. I used the following scripts to check the differences before and after the change for both NumPy 1 & 2. The output is empty for NumPy 1 since there is no change. The output is a list of `numpy.random` that considered supported for NumPy 2. ```py from torch._dynamo import trace_rules import numpy as np def new_numpy_function_ids(): unsupported_funcs = {"seed", "ranf", "get_bit_generator", "RandomState", "set_bit_generator", "sample"} def is_supported(k, v, mod): if not callable(v): return False if not getattr(v, "__module__", None): return True if v.__module__ == mod.__name__: return True if v.__module__ == "numpy.random.mtrand" and mod.__name__== "numpy.random" and k not in unsupported_funcs: return True return False rv = {} for mod in trace_rules.NP_SUPPORTED_MODULES: for k, v in mod.__dict__.items(): if is_supported(k, v, mod): rv[id(v)] = f"{mod.__name__}.{k}" return rv def old_numpy_function_ids(): rv = {} for mod in trace_rules.NP_SUPPORTED_MODULES: rv.update( { id(v): f"{mod.__name__}.{k}" for k, v in mod.__dict__.items() if callable(v) and (getattr(v, "__module__", None) or mod.__name__) == mod.__name__ } ) return rv rv1 = set(old_numpy_function_ids().values()) rv2 = set(new_numpy_function_ids().values()) for v in (rv1 - rv2): print(v) print("****") for v in (rv2 - rv1): print(v) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/138686 Approved by: https://github.com/williamwen42	2024-11-08 23:38:53 +00:00
Yifu Wang	1659e241c8	[experimental] async-tp impl with cutlass-based, progress aware kernel (#139227 ) This PR introduces the following: ### torch.ops.symm_mem._async_input_mm `_async_input_mm(Tensor a, Tensor b, Tensor a_chunk_signals, int a_chunk_pivot) -> Tensor` An mm impl that supports consuming asynchronous input. It guarantees the following rasterization order, and that the corresponding signal arrives before an input chunk is consumed. ``` num_chunks = a_chunks_signals.numel() for chunk_idx in range(a_chunk_pivot, num_chunks + a_chunk_pivot): chunk_idx = chunk_idx % num_chunks wait_signal(a_chunk_signals, chunk_idx) # Compute output tiles that consumes the input chunk ``` ### PersistentAsyncInputScheduler This is a forked version of PersistentScheduler that supports consuming asynchronous input. This tile scheduler introduces the following arguments: - `tiles_per_chunk_m` – Specifies the size of an M chunk. Chunks are the granularity at which the asynchronous input becomes ready. It must be an interger multiple of the size of an M tile. - `chunk_signals` – `chunk_signals[i] == 1` indicates that chunk i is ready. Before returning a work tile, get_current_work() waits for the signal to ensure that the corresponding chunk is ready. - `tile_idx_pivot_m` – After applying swizzling, apply `pivot(m) => (m + tile_idx_pivot_m) % tiles_m` to `m`. In a distributed setting, this allows different ranks to process different m indices at the same time, thus avoiding communication hotspots. Note that this scheduler currently only supports the `KernelTmaWarpSpecializedCooperative` kernel schedule. This is enforced via the template argument `KernelSchedule`. Usage: ``` using GemmKernel = cutlass::gemm::kernel::GemmUniversal< Shape<int, int, int, int>, CollectiveMainloop, CollectiveEpilogue, cutlass::gemm::PersistentAsyncInputScheduler<KernelSchedule>>; ``` ### _fused_all_gather_matmul_native An ag-mm impl that combines `torch.ops.symm_mem._async_input_mm` and progress-aware all-gather. This is not yet enabled via the async-tp passes. We will use it as a backend to optimize the current decomposition-based async-tp impl. ## Benchmarks ### 4096x3584x8192 - cublas + nccl: 539us - decomp-based async-tp w/o cuda graph: 694us - decomp-based async-tp w/ cuda graph: 478us - new cutlass kernel: 408us <img width="478" alt="image" src="https://github.com/user-attachments/assets/39f316ab-36c5-4b41-af77-07854a385dfc"> ### 2048x3584x8192 - cublas + nccl: 301us - decomp-based async-tp w/o cuda graph: 687us - decomp-based async-tp w/ cuda graph: 356us - new cutlass kernel: 276us <img width="441" alt="image" src="https://github.com/user-attachments/assets/9e23ce21-863b-43dd-a562-fb05d3a5a144"> ## Next Steps - Add tuning logic - Use `_fused_all_gather_matmul_native` as a backend for the decomp-based async-tp impl Differential temp Revision: [D65623152](https://our.internmc.facebook.com/intern/diff/D65623152) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139227 Approved by: https://github.com/weifengpy, https://github.com/Chillee	2024-11-08 23:28:25 +00:00
fduwjj	e3b2f04f05	[c10d][Logging] Remove args and kwargs from c10d logging (#140169 ) This PR is trying to reland https://github.com/pytorch/pytorch/pull/139804 We now don't want to log args and kwargs directly because if they contain tensor or tensor subclass it would take lots of time in conversion to string or even not supported. Pull Request resolved: https://github.com/pytorch/pytorch/pull/140169 Approved by: https://github.com/wz337	2024-11-08 23:24:52 +00:00
Scott Wolchok	cc44b55b00	Hook up bf16_gemv_trans to x86 bf16 GEMM (#139220 ) This is the big milestone for bf16 and should enable us to close https://github.com/pytorch/torchchat/issues/1253 . Testing: ran python torchchat.py generate llama3.2-1b --dtype bf16 --device cpu on x86 machine with AVX512-bf16. observed similar tokens/sec with and without MKL path hand-disabled. Also observed speedup from ~2.1 tok/sec to 7.4 tok/sec on x86 machine with only AVX2. Differential Revision: [D65170967](https://our.internmc.facebook.com/intern/diff/D65170967/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139220 Approved by: https://github.com/malfet ghstack dependencies: #139084, #139090, #139558, #139081, #139208	2024-11-08 23:24:36 +00:00
Scott Wolchok	25c469bac3	Build bf16 gemv fast path & entry points for non-ARM architectures too (#139208 ) Very similar to #137917, but for bf16. Differential Revision: [D65155971](https://our.internmc.facebook.com/intern/diff/D65155971/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139208 Approved by: https://github.com/malfet ghstack dependencies: #139084, #139090, #139558, #139081	2024-11-08 23:24:36 +00:00
Scott Wolchok	7f0bf9f961	Move bf16_gemv_trans to ReducedPrecisionFloatGemvFastPathKernel (#139081 ) Following the previous move of fp16_gemv_trans. Testing: Checked for performance regression with llm_benchmarks' `python benchmarks/benchmark_torch_mm.py llm`, didn't find one Differential Revision: [D64930872](https://our.internmc.facebook.com/intern/diff/D64930872/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139081 Approved by: https://github.com/malfet ghstack dependencies: #139084, #139090, #139558	2024-11-08 23:24:29 +00:00
Scott Wolchok	44f6d1439e	Unbreak vec128_half_neon comparison without FP16 hardware support (#139558 ) Discovered this bug when working on Vectorized<BFloat16>; apparently we have no automated testing for aarch64 without FP16. Testing: Manually disable FP16 feature for local vec_test_all_types run on Mac; see pass. Differential Revision: [D65385267](https://our.internmc.facebook.com/intern/diff/D65385267/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139558 Approved by: https://github.com/malfet ghstack dependencies: #139084, #139090	2024-11-08 23:24:22 +00:00
Nikita Shulga	ac6b6c6f98	[BE][CI] Use `pip3` instead of `pip` (#140185 ) As on modern distros(see this oldie but goodie: https://launchpad.net/ubuntu/focal/+package/python-is-python3 ), `pip` alias might be missing or indeed point to Python2 installation Pull Request resolved: https://github.com/pytorch/pytorch/pull/140185 Approved by: https://github.com/wdvr, https://github.com/huydhn, https://github.com/seemethere	2024-11-08 23:15:02 +00:00
Natalia Gimelshein	1cdaf1d85f	correctly keep track of processed tensors for foreach reductions (#140103 ) Fixes #140066 Pull Request resolved: https://github.com/pytorch/pytorch/pull/140103 Approved by: https://github.com/janeyx99 Co-authored-by: Jane Xu <janeyx@meta.com>	2024-11-08 23:04:53 +00:00
Nikita Shulga	f3cbf67686	[CD] Build aarch64 wheels without conda (#140093 ) As manylinuxaarch64-builder already comes pre-built with all versions of python runtime Refactor logic for setting path to DESIRED_PYTHON from `manywheel/build_common` into `set_desired_python.sh` and call it from aarch64_ci_setup.sh In followup PRs move scons and ninja installation into base docker image Pull Request resolved: https://github.com/pytorch/pytorch/pull/140093 Approved by: https://github.com/atalman	2024-11-08 22:24:28 +00:00
Gabriel Ferns	95198f8299	Remove uses of deleted operations (#139447 ) resolves: https://github.com/pytorch/pytorch/issues/138721 Summary: Delete the uses of deleted nodes. The double for-loop is icky here, but N should be pretty small and removing it requires refactoring the datastructures involved, which is a bigger endeavor. Test Plan: Normal test coverage should be sufficient. There were a couple of spots in the scheduler code that didn't check users being deleted, so I'll run a perf test to see what impact that has, and to make sure N^2 doesn't affect compile times. Perf: https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Tue%2C%2029%20Oct%202024%2017%3A41%3A36%20GMT&stopTime=Tue%2C%2005%20Nov%202024%2018%3A41%3A36%20GMT&granularity=hour&suite=torchbench&mode=inference&dtype=bfloat16&deviceName=cuda%20(a100)&lBranch=exclamaforte/prune-deleted-users&lCommit=5cb1aa6f7d8a52acdae0c7cf36b8c2d536d7f0d1&rBranch=main&rCommit=f4ee5a243dbb31e6310e5632b1c87898b299df2c off of nov4 nightly Pull Request resolved: https://github.com/pytorch/pytorch/pull/139447 Approved by: https://github.com/eellison	2024-11-08 22:21:53 +00:00
PyTorch MergeBot	347f96061f	Revert "[cpu] Modify inductor opt flag --- ftree-loop-vectorize (#136827 )" This reverts commit cf0bb6c435c58db4c72e489f462e1a0ebe310f14. Reverted https://github.com/pytorch/pytorch/pull/136827 on behalf of https://github.com/ZainRizvi due to Sorry but this breaks internally. See D65605094 for more details ([comment](https://github.com/pytorch/pytorch/pull/136827#issuecomment-2465805271))	2024-11-08 21:52:33 +00:00
PyTorch MergeBot	a7724518c0	Revert "[Inductor][CPU] Fuse SmoothQuant int8 linear pattern (#139595 )" This reverts commit d72a308e77ec8895d48798dda05996cbc44ffa3e. Reverted https://github.com/pytorch/pytorch/pull/139595 on behalf of https://github.com/ZainRizvi due to Sorry but the newly added tests in test_mkldnn_pattern_matcher.py fail internally. See D65661038 for more details ([comment](https://github.com/pytorch/pytorch/pull/139595#issuecomment-2465797016))	2024-11-08 21:45:52 +00:00
PyTorch MergeBot	80d0356b11	Revert "Make Context to be Device-agnostic Step by Step (2/N) (#136526 )" This reverts commit c03324de2dfbbf0006818c86b88c92a3378f46b7. Reverted https://github.com/pytorch/pytorch/pull/136526 on behalf of https://github.com/ZainRizvi due to This fails to build internally. See D65604944 for more details ([comment](https://github.com/pytorch/pytorch/pull/136526#issuecomment-2465790157))	2024-11-08 21:40:10 +00:00
PyTorch MergeBot	3483f7809e	Revert "Fix typo in associative_scan tests (#139929 )" This reverts commit 7fa94f03635709a30ef85c6955dcdd5051e72e71. Reverted https://github.com/pytorch/pytorch/pull/139929 on behalf of https://github.com/ZainRizvi due to This test is breaking in trunk somehow, which is really weird. functorch/test_control_flow.py::AssociativeScanTests::test_associative_scan_binary_operator_compile_mode_compile_dynamic_shape_combine_mode_pointwise_reverse_False_cuda [GH job link](https://github.com/pytorch/pytorch/actions/runs/11747748990/job/32732254909) [HUD commit link](`7fa94f0363`) ([comment](https://github.com/pytorch/pytorch/pull/139929#issuecomment-2465773366))	2024-11-08 21:26:41 +00:00
Zain Rizvi	411203e7c1	Revert D65490202 (#140142 ) Summary: This diff reverts D65490202 This is causing tests to fail on open source. See distributed/test_c10d_logger.py::C10dErrorLoggerTest::test_exception_logger [GH job link](https://github.com/pytorch/pytorch/actions/runs/11736922614/job/32697709457) [HUD commit link](`ba9645f6e5`) Test Plan: NA Differential Revision: D65663063 Pull Request resolved: https://github.com/pytorch/pytorch/pull/140142 Approved by: https://github.com/malfet, https://github.com/huydhn	2024-11-08 21:22:32 +00:00
Catherine Lee	119e0699cc	[ez] Add .lintrunner.private.toml to .gitignore (#140166 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/140166 Approved by: https://github.com/Skylion007	2024-11-08 20:55:21 +00:00
Bin Bao	63a0d6587e	[AOTI] Update the OSS tutorial (#139956 ) Summary: Update the OSS tutorial to use the new aoti_compile_and_package and aoti_load_package APIs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139956 Approved by: https://github.com/angelayi ghstack dependencies: #139955	2024-11-08 20:46:57 +00:00
PyTorch MergeBot	07ad74635b	Revert "[Reland] Use static_assert to detect get_type_index used in device code (#139966 )" This reverts commit ca7fdfe4d25f91c4cae48fde6eeac990738447f2. Reverted https://github.com/pytorch/pytorch/pull/139966 on behalf of https://github.com/malfet due to This approach will prevent one from using get_type_index from device code ([comment](https://github.com/pytorch/pytorch/pull/139966#issuecomment-2465701260))	2024-11-08 20:32:43 +00:00
Animesh Jain	e6c5a77485	[dynamo][guards] Profile guard manager in C++ (#140110 ) This should remove the pybind noise from the profiling. Pull Request resolved: https://github.com/pytorch/pytorch/pull/140110 Approved by: https://github.com/jansel ghstack dependencies: #139953	2024-11-08 18:44:08 +00:00
Animesh Jain	a140e65e0f	[dynamo] Support method with different __self__ on user defined objects (#139953 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139953 Approved by: https://github.com/jansel	2024-11-08 18:44:08 +00:00
William Wen	d18bca4961	[dynamo] switch to get_framelocals_mapping for 3.10 and below (#140037 ) Part of implementing https://github.com/pytorch/pytorch/issues/93753. Next step will be to use a lower overhead data structure over `py::dict`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/140037 Approved by: https://github.com/jansel ghstack dependencies: #139921, #139950	2024-11-08 18:43:54 +00:00
William Wen	bbd427faf5	[dynamo] switch to get_framelocals_mapping for 3.11 (#139950 ) Part of implementing https://github.com/pytorch/pytorch/issues/93753 Pull Request resolved: https://github.com/pytorch/pytorch/pull/139950 Approved by: https://github.com/jansel ghstack dependencies: #139921	2024-11-08 18:43:54 +00:00
Thomas Bohnstingl	7fa94f0363	Fix typo in associative_scan tests (#139929 ) Fix typo with Associative_Scan tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/139929 Approved by: https://github.com/ydwu4	2024-11-08 18:42:26 +00:00
xinan.lin	dfcf740a61	Fix traceback.format_exception(...) positional arguments error. (#140109 ) Fix #140095 Pull Request resolved: https://github.com/pytorch/pytorch/pull/140109 Approved by: https://github.com/EikanWang, https://github.com/jansel, https://github.com/eellison	2024-11-08 18:22:32 +00:00
Scott Wolchok	8d61add14a	Add Vectorized<c10::BFloat16> specialization for ARM (#139090 ) When we have hardware support, we can use it. When we don't have hardware support, we can still do better than vec_base.h. I'm not sure to what extent we're set up to properly test both `defined(__ARM_FEATURE_BF16)` and `!defined(__ARM_FEATURE_BF16)` builds, feedback especially welcome there. Testing: vec_test_all_types should cover correctness. For perf, seems clear that using vectorized intrinsics should be better than vec_base? Differential Revision: [D64997747](https://our.internmc.facebook.com/intern/diff/D64997747/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139090 Approved by: https://github.com/jgong5, https://github.com/malfet ghstack dependencies: #139084	2024-11-08 17:11:40 +00:00
Scott Wolchok	8690f60f39	Extract value_type-generic NEON Vectorized<Half> functions to CRTP base class (#139084 ) This is in prepraration for adding NEON Vectorized<BFloat16>, which will be simplified by sharing this stuff. Differential Revision: [D64997744](https://our.internmc.facebook.com/intern/diff/D64997744/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139084 Approved by: https://github.com/malfet	2024-11-08 17:11:40 +00:00
Bin Bao	1868fc63d8	[AOTI] Update C++ runner API to take a const vector (#139955 ) Summary: Tighten the AOTIModelContainerRunner::run interface to take a const vector of at::Tensor, which 1) makes it clear that the runner will not modify the input tensor vector; 2) runner will be able to take a temp vector of tensors as the input. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139955 Approved by: https://github.com/chenyang78	2024-11-08 16:59:10 +00:00
PyTorch MergeBot	fc6496c703	Revert "Enable inductor-rocm workflow for all trunk commits AND inductor-related PRs (#138623 )" This reverts commit ee7c3db092e09cde37ee33648dff1955bcd71e82. Reverted https://github.com/pytorch/pytorch/pull/138623 on behalf of https://github.com/huydhn due to I think the link failure is legit, it complains about the wrong concurrency setting in the workflow ([comment](https://github.com/pytorch/pytorch/pull/138623#issuecomment-2465277228))	2024-11-08 16:58:05 +00:00
eellison	9d99dceb53	Fix split decomp returning self (#140065 ) Previously the split decomp would return the input when there were no splits. this errors in torch.compile (or FakeTensorMode) with : > RuntimeError: View operation returned a tensor that is the same as the input base tensor. This is no longer allowed; you must explicitly create a new tensor (e.g., using .detach()). As a user, you could have made a mistake implementing __torch_dispatch__ or a Python operator decomposition or meta registration; if that's not the case, please report a bug to PyTorch or the backend you are using. Fix for https://github.com/pytorch/pytorch/issues/133394 Differential Revision: [D65635070](https://our.internmc.facebook.com/intern/diff/D65635070) Pull Request resolved: https://github.com/pytorch/pytorch/pull/140065 Approved by: https://github.com/bdhirsh	2024-11-08 16:53:18 +00:00
Andrey Talman	22cd1ee951	[CD] Enable 3.13 triton build (#140137 ) Copied from https://github.com/pytorch/pytorch/pull/139652 Pull Request resolved: https://github.com/pytorch/pytorch/pull/140137 Approved by: https://github.com/huydhn, https://github.com/malfet	2024-11-08 16:34:10 +00:00
iremyux	dd79d2f5e7	Removing warning for Windows Arm64 (#139746 ) This PR removes the warning message on Windows on Arm64, which was triggered by an issue in one of the DLLs, to improve the user experience. `Microsoft Visual C++ Redistributable is not installed, this may lead to the DLL load failure. It can be downloaded at https://aka.ms/vs/16/release/vc_redist.x64.exe` The issue is being tracked here: https://developercommunity.visualstudio.com/t/VCRUNTIME140_1DLL-Miscompiled-for-Arm64/10781635? Pull Request resolved: https://github.com/pytorch/pytorch/pull/139746 Approved by: https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2024-11-08 16:23:59 +00:00
Andrea Frittoli	1d2d9f0de8	Give the magma build job id-token write permissions (#140141 ) The configure-aws-credentials action requires special permissions: https://github.com/aws-actions/configure-aws-credentials?tab=readme-ov-file#oidc Give "id-token: write" permssion to the job that sets the AWS credentials to upload to the S3 bucket. Fixes #139397 Pull Request resolved: https://github.com/pytorch/pytorch/pull/140141 Approved by: https://github.com/atalman	2024-11-08 15:59:49 +00:00
Jithun Nair	ee7c3db092	Enable inductor-rocm workflow for all trunk commits AND inductor-related PRs (#138623 ) It should help with triaging ROCm-inductor-related breakages and surfacing them in the PRs itself. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138623 Approved by: https://github.com/huydhn	2024-11-08 15:54:09 +00:00
zeshengzong	7167323644	Fix type description of torch.chunk (#140089 ) Fixes #126278 - Change return type description of `torch.chunk` to tuple - Add type for input parameters Before ![image](https://github.com/user-attachments/assets/087b6cfa-0815-443b-a69a-785ca4b421d7) After ![image](https://github.com/user-attachments/assets/19532553-6004-4246-a6cf-f7f685f5775c) Pull Request resolved: https://github.com/pytorch/pytorch/pull/140089 Approved by: https://github.com/awgu	2024-11-08 15:21:13 +00:00
Adnan Akhundov	838958de94	[inductor] Support autotune restore_value for user-defined Triton kernels (#139851 ) This PR adds support for the `restore_value` argument of the `@triton.autotune` for the user-defined Triton kernels in PT2. The `kernel.restore_idx` are extracted in the `ir.UserDefinedTritonKernel` and the corresponding arg names are placed into the `triton_meta["restore_value"]`. From there, those are added to the existing `mutated_arg_names` in the caching autotuner infra which already exists and leads to the listed argss being cloned. This achieves the equivalent effect to the native `restore_value`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139851 Approved by: https://github.com/oulgen	2024-11-08 14:59:00 +00:00
Jack Taylor	a33fa37b4e	[ROCm] Support new AMD triton stream pipeliner (#139881 ) Fixes #139182 In Triton 3.2 num_stages=0 will be deprecated with Triton's AMD backend. Let's query default num_stages from the relevant triton backend Pull Request resolved: https://github.com/pytorch/pytorch/pull/139881 Approved by: https://github.com/bertmaher	2024-11-08 14:51:05 +00:00
Andrea Frittoli	c1c94cb0be	Build magma binary tarballs for various cuda (#139888 ) This is a first step towards removing builds dependency to conda. Currently we build magma as a conda package in a pytorch conda channel, implemented in `a1b372dbda/magma`. This commit adapts the logic from pytorch/builder as follows: - use pytorch/manylinux-cuda<cuda-version> as base image - apply patches and invoke the build.sh script directly (not anymore through conda build) - stores license and build files along with the built artifact, in an info subfolder - create a tarball file which resembles that created by conda, without any conda-specific metadata A new matrix workflow is added, which runs the build for each supported cuda version, and uploads the binaries to pyorch s3 bucket. For the upload, define an upload.sh script, which will be used by the magma windows job as well, to upload to `s3://ossci-*` buckets. The build runs on PR and push, upload runs in DRY_RUN mode in case of PR. Fixes #139397 Pull Request resolved: https://github.com/pytorch/pytorch/pull/139888 Approved by: https://github.com/atalman, https://github.com/malfet, https://github.com/seemethere	2024-11-08 13:28:27 +00:00
Luca Wehrstedt	5f287df422	Add type information for FakeProcessGroup (#133211 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133211 Approved by: https://github.com/Skylion007	2024-11-08 11:18:52 +00:00
taozhiwei	e5574445b0	add supports_coalescing property in c10d::Backend to determine whether backend supports coalescing (#135338 ) 1. My company is using privateuseone to connect new hardware device and requires the use of `batch_isend_irecv` function. However, `batch_isend_irecv` is currently only open to CUDA, so I add `supports_coalescing` property in `c10d::Backend` to determine whether backend supports coalescing. 2. If `pg._has_hooks` return True, We don't need to determine if the current device is CUDA. So privateuseone can also support `pg._wait_for_pending_works` Pull Request resolved: https://github.com/pytorch/pytorch/pull/135338 Approved by: https://github.com/kwen2501	2024-11-08 11:08:45 +00:00
Ratnam Parikh	0b7a2d4aef	[Windows XPU] Fix MSVC ambiguous symbol error (#138727 ) PT master build with XPU will fail due to MSVC issue of ambiguous symbol error 'std', previously fixed it with MSVC flag in torch-xpu-ops https://github.com/intel/torch-xpu-ops/pull/946/files, but the error is observed in PT master too after 2.5 and oneAPI update. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138727 Approved by: https://github.com/guangyey, https://github.com/ezyang	2024-11-08 08:29:36 +00:00
Wu, Chunyuan	a3052b3b7c	Inductor cpp wrapper: clean-up hard-coded schema and related code (#139873 ) Fixes https://github.com/pytorch/pytorch/issues/112552. non-ABI compatible mode has been removed thus the following values are not needed anymore: `extern_call_ops` `cpp_op_schema` `cpp_kernel_key` `cpp_kernel_overload_name` Pull Request resolved: https://github.com/pytorch/pytorch/pull/139873 Approved by: https://github.com/jgong5, https://github.com/desertfire	2024-11-08 08:15:51 +00:00
Shunting Zhang	d9def02050	[Inductor] record time for 'compile time' autotuning (#139431 ) Here are the cases that Inductor does autotuning at compile time: 1. pad mm: benchmark to decide if we should pad or not 2. template autotuning: benchmark triton/cutlass templates and ATen kernel for matmul/conv and pick the fastest one. The PR annotate these cases with `dynamo_timed` Pull Request resolved: https://github.com/pytorch/pytorch/pull/139431 Approved by: https://github.com/ezyang	2024-11-08 07:17:00 +00:00
Oguz Ulgen	011781f29d	Assert that bundled triton payload does not have sentinel value (#139375 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139375 Approved by: https://github.com/ezyang	2024-11-08 07:11:40 +00:00
Junjie Wang (PyTorch)	ba9645f6e5	Fix for T206766523 ("Your diff, D65462767, broke some tests") (#139804 ) Summary: This is trying to fix a regression caused by https://github.com/pytorch/pytorch/pull/139757. We now don't want to log args and kwargs directly because if they contain tensor or tensor subclass it would take lots of time in conversion to string or even not supported. Reviewed By: fduwjj Differential Revision: D65490202 Pull Request resolved: https://github.com/pytorch/pytorch/pull/139804 Approved by: https://github.com/XilunWu	2024-11-08 05:57:30 +00:00
Xia, Weiwen	d72a308e77	[Inductor][CPU] Fuse SmoothQuant int8 linear pattern (#139595 ) About the PR In the implementation of SmoothQuant in Torchao, quantized linear is computed by `_int_mm(a, b)` + `mul(b_scale)` + `mul(a_scale)` (+ optional `add` for bias) with `reshape` and `convert_dtype` in between. This PR adds a pass to fuse the corresponding patterns: - (no bias) `reshape -> _int_mm -> convert_element_type -> (expand -> mul) -> mul -> reshape` - (with bias) `pattern_no_bias -> add -> reshape -> reshape` The patterns are replaced by `onednn.qlinear_pointwise` and `onednn.qlinear_prepack`, the latter of which is evaluated and frozen during the freezing process of Inductor. The final graph contains `onednn.qlinear_pointwise` only with packed weight constants. Note that `onednn.qlinear_pointwise` does not support per-channel quantization of activation, which is a limitation of oneDNN library, so in that case we set activation scale to 1 and bias to none and apply scales and add bias after `onednn.qlinear_pointwise`. Validation results Accuracy/perplexity is not changed with or without this fusion pass. Latency is improved by >10% with the fusion pass. Test method: - Model: EleutherAI/gpt-j-6b - Hardware: Intel(R) Xeon(R) Platinum 8490H, running on 1 socket, 60 cores - Using Intel OMP and Tcmalloc - Running [the example script of SmoothQuant in Torchao](https://github.com/pytorch/ao/blob/main/torchao/prototype/smoothquant/example.py) with `TORCHINDUCTOR_FREEZING=1 numactl -N1 python example.py -m EleutherAI/gpt-j-6b --device=cpu --quant-mode=dynamic --compile` Test plan ``` python test/inductor/test_mkldnn_pattern_matcher.py -k test_smooth_quant_with_int_mm ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/139595 Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5, https://github.com/jerryzh168	2024-11-08 05:33:16 +00:00
wz337	8715fb8aff	[DTensor][unpickler] Add DTensor related classes to allowed globals so we can still torch.load(DTensor) with weights_only=True (#139949 ) Test uses `torch.load()` for DTensor state_dict: ``` python3 test/distributed/fsdp/test_fsdp_dtensor_state_dict.py -k TestFSDPWithDeviceMeshAndDTensor ``` In this PR, we add `DTensor` related class to allowed safe globals so we can still `torch.load()` a `DTensor` with `weights_only=True`. We also need this for backward compatibility, since `DTensor` can be `torch.load()` before `weights_only` defaults to True. Without the change, `torch.load()` a `DTensor` would run into the following error: ``` _pickle.UnpicklingError: Weights only load failed. This file can still be loaded, to do so you have two options, do those steps only if you trust the source of the checkpoint. (1) Re-running `torch.load` with `weights_only` set to `False` will likely succeed, but it can result in arbitrary code execution. Do it only if you got the file from a trusted source. (2) Alternatively, to load with `weights_only=True` please check the recommended steps in the following error message. WeightsUnpickler error: Unsupported global: GLOBAL torch.distributed.tensor.DTensor was not an allowed global by default. Please use `torch.serialization.add_safe_globals([DTensor])` or the `torch.serialization.safe_globals([DTensor])` context manager to allowlist this global if you trust this class/function. ``` The unit test failure is not being captured by CI when `weights_only` being rolled out for `torch.load()` by default. This is due to another issue that the test communication wrapper `with_comms` let unit tests silently pass without capturing failure due to a recent change (https://github.com/pytorch/pytorch/pull/138108). This wrapper issue is going to be fixed by a separate PR https://github.com/pytorch/pytorch/pull/139637. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139949 Approved by: https://github.com/mikaylagawarecki	2024-11-08 05:06:11 +00:00
eellison	b042606d91	Loosen last dim contiguity for sdpa constraint to include last dim 0,1 (#139787 ) Previously we were checking for a last dim with stride == 1. When the size is <= 1 that also is sufficient because the stride is insignificant. Fix for https://github.com/pytorch/pytorch/issues/138317 Pull Request resolved: https://github.com/pytorch/pytorch/pull/139787 Approved by: https://github.com/drisspg	2024-11-08 04:54:05 +00:00
Edward Z. Yang	114a0bc306	Make PGO work correctly with NJT inputs (#140046 ) We were actually triggering a latent bug where nested ints were uselessly being incorporated into the automatic dynamic state, even though they were unconditionally ignored afterwards. Now we munge them out before putting them in. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Differential Revision: [D65623303](https://our.internmc.facebook.com/intern/diff/D65623303) Pull Request resolved: https://github.com/pytorch/pytorch/pull/140046 Approved by: https://github.com/jbschlosser, https://github.com/bdhirsh ghstack dependencies: #140042	2024-11-08 04:27:39 +00:00
Edward Z. Yang	af682f3cd7	Move put_code_state to only trigger on successful compile (#140042 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Differential Revision: [D65623081](https://our.internmc.facebook.com/intern/diff/D65623081) Pull Request resolved: https://github.com/pytorch/pytorch/pull/140042 Approved by: https://github.com/markkm	2024-11-08 04:19:50 +00:00
cyy	43f0fe60a3	[Environment Variable][5/N] Use thread-safe getenv functions (#139762 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/139762 Approved by: https://github.com/ezyang	2024-11-08 03:49:09 +00:00
Animesh Jain	86792a5a8d	[invoke_subgraph] User facing API to support arbitrary args and kwargs (#139162 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139162 Approved by: https://github.com/zou3519	2024-11-08 03:31:19 +00:00
Andrea Frittoli	4715b77001	Create manylinux 2.28 cuda 12.6 image (#139909 ) Add a version of the manylinux 2.28 image with cuda 12.6. Once this is done, cuda 12.6 can be enable for the new magma non-conda distribution provided by https://github.com/pytorch/pytorch/pull/139888 Partially-fixes #139397 Pull Request resolved: https://github.com/pytorch/pytorch/pull/139909 Approved by: https://github.com/atalman	2024-11-08 03:03:04 +00:00
Jerry Zhang	1fcc99c6bf	Update quantization.rst (#139824 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/139824 Approved by: https://github.com/svekars	2024-11-08 02:34:50 +00:00
Nikita Shulga	347d134ee2	[BE] Delete `DeprecatedTypeProperties` cast (#139358 ) Differential Revision: [D65549001](https://our.internmc.facebook.com/intern/diff/D65549001) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139358	2024-11-07 18:28:29 -08:00
PyTorch MergeBot	f0ffaa5e16	Revert "[inductor] fix test_linear_binary_dynamic_shapes_cpp_wrapper (#139942 )" This reverts commit 0618c7fe667a4ca3891d0699bfd7cf2e4964924b. Reverted https://github.com/pytorch/pytorch/pull/139942 on behalf of https://github.com/huydhn due to Sorry for revert this, but I think we miss running the test and it is now failing in trunk ([comment](https://github.com/pytorch/pytorch/pull/139942#issuecomment-2463599298))	2024-11-08 01:55:48 +00:00
cyy	da1e120dfd	[2/N] Replace c10::sv with std::sv (#139456 ) Follows #139453 Pull Request resolved: https://github.com/pytorch/pytorch/pull/139456 Approved by: https://github.com/ezyang, https://github.com/malfet	2024-11-08 01:48:00 +00:00
John MacCormick	81d077cca2	Fix to modules.rst: indent line with activation functions (#139667 ) At line 205, I believe the code `x = self.activations[act](x)` should be indented so that it is in the body of the for loop. Otherwise, applying the four linear modules has the same effect as applying a single linear module, in the sense that it is still just a linear map so there is no point in having four of them. In other words, each layer of this network should have a nonlinearity. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139667 Approved by: https://github.com/malfet	2024-11-08 01:12:52 +00:00
Nikita Shulga	103cbd7231	[MPS] Restrict MSELoss to floating types (#139960 ) Becuase if invoked with long type it crahses deep in MPSGraph framework and to keep parity with CPU Add test that validates that if dtype is not floating, both CPU and MPS implementations will error out Fix function name for `mse_loss_out_mps` as `__func__` for any structured op implementation is `impl` Fixes https://github.com/pytorch/pytorch/issues/139723 Pull Request resolved: https://github.com/pytorch/pytorch/pull/139960 Approved by: https://github.com/kimishpatel ghstack dependencies: #139961, #139959	2024-11-08 00:28:54 +00:00
Ke Wen	1127c82592	Revert #137523 : Add functionality to call dump function of NCCL profiler plugin (#139847 ) Reverts PR https://github.com/pytorch/pytorch/pull/137523 Reasons for the reversion: 1. NCCL profiler plugin is meant to be opened by NCCL. And the profiler's implementation is meant to be provided by a profiler. There is no evidence that `torch.distributed` is at a better position to be either an opener or a provider. (The PR to be reverted made `torch.distributed` an opener). 2. The main purpose of the reverted PR is to dlopen a dump function, with the help of an environment variable `NCCL_PROFILER_PLUGIN_FUN` that provides the symbol name, as in code below: `c19c384690/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp (L415-L427)` After some investigation, NCCL does not support env var `NCCL_PROFILER_PLUGIN_FUN`. And NCCL's profiler contract `nccl_profiler.h` does not have a function called "ncclProfilerPluginDump" defined. So this looks like a private add-on. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139847 Approved by: https://github.com/c-p-i-o	2024-11-08 00:24:29 +00:00
cyy	bf1b8adee6	Turn static inline into static function (#139843 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/139843 Approved by: https://github.com/ezyang	2024-11-07 23:58:18 +00:00
Sam Larsen	dbaa431dfb	Put remote fx cache dynamo_timed definition in OSS location (#140016 ) Summary: I'm refactoring dynamo_timed and updating the params. It will be much easier to do this refactor entirely in OSS. So this diff essentially provides a couple aliases in the OSS area that I can update without affecting the internal usage. Test Plan: Ran locally and made sure I still got samples: https://fburl.com/scuba/dynamo_compile/sandbox/qub89lwj Reviewed By: oulgen Differential Revision: D65580302 Pull Request resolved: https://github.com/pytorch/pytorch/pull/140016 Approved by: https://github.com/oulgen	2024-11-07 23:51:48 +00:00
Nikita Shulga	ae01f2b61b	Extend CPU implementation of MSELoss to BF16 (#139959 ) It's strange that it has not been implemented for the type yet Pull Request resolved: https://github.com/pytorch/pytorch/pull/139959 Approved by: https://github.com/jgong5, https://github.com/janeyx99 ghstack dependencies: #139961	2024-11-07 23:50:15 +00:00
Tongzhou Wang	22dd17c7bb	[doc] fixing missing colon in custom op doc (#140060 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/140060 Approved by: https://github.com/malfet	2024-11-07 23:48:44 +00:00
Pian Pawakapan	c076001ed9	handle AttrProxy._modules when module is overwritten as None (#139957 ) Fixes tracing through `mod._modules` access, when one of the submodules has been reset to None Pull Request resolved: https://github.com/pytorch/pytorch/pull/139957 Approved by: https://github.com/zhxchen17	2024-11-07 23:39:48 +00:00
Mikayla Gawarecki	2ee91db03d	Add APIs to separate norm calculation and gradient scaling in `nn.utils.clip_grad_norm_` (#139662 ) Fixes https://github.com/pytorch/pytorch/issues/139467 Refactor `nn.utils.clip_grad_norm_` into `nn.utils.get_total_norm` and then `nn.utils.clip_grads_with_norm_` . `clip_grad_norm_` now calls into these two new ops, `get_total_norm` is generalized (rather than `get_grad_norm` due to the discussion on the issue from @awgu) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139662 Approved by: https://github.com/H-Huang	2024-11-07 23:13:23 +00:00
Huy Do	09ba38c4b7	Add an opt-out label to runner determinator on PR (#140054 ) My sales pitch: I need to ssh into the runner from time to time on my PR to debug issues, but it's well-known that LF runners don't support SSH login anymore. So, the propose fix here is to introduce a new label called ~no-runner-determinator~ `no-runner-experiments` that can be attached to the PR. Whenever `.github/scripts/runner_determinator.py` runs on a PR and sees this label, it will not apply any logic and just straight up use an empty prefix. ### Testing With the label: ``` python3 runner_determinator.py \ --github-token "MY_TOKEN" \ --github-issue "5132" \ --github-branch "install-torchao-torchtune-et" \ --github-actor "huydhn" \ --github-issue-owner "huydhn" \ --github-ref-type "branch" \ --github-repo "pytorch/pytorch" \ --eligible-experiments "" \ --pr-number "139947" INFO : Opt-out runner determinator because #139947 has no-runner-determinator label WARNING : No env var found for GITHUB_OUTPUT, you must be running this code locally. Falling back to the deprecated print method. ::set-output name=label-type:: ``` Without the label: ``` python3 runner_determinator.py \ --github-token "MY_TOKEN" \ --github-issue "5132" \ --github-branch "install-torchao-torchtune-et" \ --github-actor "huydhn" \ --github-issue-owner "huydhn" \ --github-ref-type "branch" \ --github-repo "pytorch/pytorch" \ --eligible-experiments "" \ --pr-number "139947" INFO : Based on rollout percentage of 95%, enabling experiment lf. INFO : Skipping experiment 'awsa100', as it is not a default experiment WARNING : No env var found for GITHUB_OUTPUT, you must be running this code locally. Falling back to the deprecated print method. ::set-output name=label-type::lf. ``` Running in trunk commit without a PR number will use the regular logic: ``` python3 runner_determinator.py \ --github-token "MY_TOKEN" \ --github-issue "5132" \ --github-branch "install-torchao-torchtune-et" \ --github-actor "huydhn" \ --github-issue-owner "huydhn" \ --github-ref-type "branch" \ --github-repo "pytorch/pytorch" \ --eligible-experiments "" \ --pr-number "" INFO : Based on rollout percentage of 95%, enabling experiment lf. INFO : Skipping experiment 'awsa100', as it is not a default experiment WARNING : No env var found for GITHUB_OUTPUT, you must be running this code locally. Falling back to the deprecated print method. ::set-output name=label-type::lf. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/140054 Approved by: https://github.com/malfet, https://github.com/ZainRizvi	2024-11-07 22:55:27 +00:00
Zhengxu Chen	ba499c32cb	[export] Disable AttrProxy when every submodule has a unique path. (#139918 ) Summary: In most cases, we don't need to turn on AttrProxy tracing for two reasons: 1. It's only needed when you have one submodule owning multiple FQNs. 2. AND it will cause model using module identity to be traced incorrectly (because we substitute module objects at tracing time). Overall after offline discussion with some export folk, we think it's better to turn off AttrProxy if we can make sure every submodule has unique FQN, which tends to be the common case. Test Plan: buck test mode/opt caffe2/test:test_export -- -r module_dict_key Differential Revision: D65555919 Pull Request resolved: https://github.com/pytorch/pytorch/pull/139918 Approved by: https://github.com/tugsbayasgalan	2024-11-07 22:43:14 +00:00
Animesh Jain	75f3056c81	[hop-db] Import invoke_subgraph to avoid Dynamo error on mac (#140038 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/140038 Approved by: https://github.com/ydwu4	2024-11-07 22:36:57 +00:00
Shunting Zhang	0618c7fe66	[inductor] fix test_linear_binary_dynamic_shapes_cpp_wrapper (#139942 ) I recently added a new pattern here https://github.com/pytorch/pytorch/pull/139136 to remove pointless view/permute pairs. At that PR, I've already updated the matched pattern/node count in `test_linear_binary` to account for the new pattern. But it looks like with cpp wrapper, one more pattern will be matched. ``` 7 patterns without cpp-wrapper: ========== pattern matched <code object pointless_view at 0x7f6d25c67aa0, file "/home/shunting/ws/pytorch/torch/_inductor/fx_passes/joint_graph.py", l ine 568> ======= ========== pattern matched <code object pointless_view_pair at 0x7f6d25c67b50, file "/home/shunting/ws/pytorch/torch/_inductor/fx_passes/joint_graph.p y", line 581> ======= ========== pattern matched <code object pointless_view at 0x7f6d25c67aa0, file "/home/shunting/ws/pytorch/torch/_inductor/fx_passes/joint_graph.py", l ine 568> ======= ========== pattern matched <code object pointless_view at 0x7f6d25c67aa0, file "/home/shunting/ws/pytorch/torch/_inductor/fx_passes/joint_graph.py", l ine 568> ======= ========== pattern matched <code object linear at 0x7f6d176e5dc0, file "/home/shunting/ws/pytorch/torch/_inductor/fx_passes/mkldnn_fusion.py", line 11 21> ======= ========== pattern matched <code object reshape_linear_reshape_pattern at 0x7f6d176e5210, file "/home/shunting/ws/pytorch/torch/_inductor/fx_passes/mk ldnn_fusion.py", line 732> ======= ========== pattern matched <code object fn at 0x7f6d176d3ec0, file "/home/shunting/ws/pytorch/torch/_inductor/fx_passes/mkldnn_fusion.py", line 476> = ====== 8 patterns with cpp wrapper: ========== pattern matched <code object pointless_view at 0x7f8e78bf07c0, file "/home/shunting/ws/pytorch/torch/_inductor/fx_passes/joint_graph.py", l ine 568> ======= ========== pattern matched <code object pointless_view_pair at 0x7f8e78bf0870, file "/home/shunting/ws/pytorch/torch/_inductor/fx_passes/joint_graph.p y", line 581> ======= ========== pattern matched <code object pointless_view at 0x7f8e78bf07c0, file "/home/shunting/ws/pytorch/torch/_inductor/fx_passes/joint_graph.py", l ine 568> ======= ========== pattern matched <code object pointless_view at 0x7f8e78bf07c0, file "/home/shunting/ws/pytorch/torch/_inductor/fx_passes/joint_graph.py", l ine 568> ======= ========== pattern matched <code object pointless_view at 0x7f8e78bf07c0, file "/home/shunting/ws/pytorch/torch/_inductor/fx_passes/joint_graph.py", l ine 568> ======= ========== pattern matched <code object linear at 0x7f8e59c04190, file "/home/shunting/ws/pytorch/torch/_inductor/fx_passes/mkldnn_fusion.py", line 11 21> ======= ========== pattern matched <code object reshape_linear_reshape_pattern at 0x7f8e59dfb520, file "/home/shunting/ws/pytorch/torch/_inductor/fx_passes/mk ldnn_fusion.py", line 732> ======= ========== pattern matched <code object fn at 0x7f8e59dfa290, file "/home/shunting/ws/pytorch/torch/_inductor/fx_passes/mkldnn_fusion.py", line 476> = ====== ``` I fixed this test by +1 to the expected number if cpp wrapper is enabled. But I think fundamentally can we not assert for the total number of patterns matched in the test? I think that makes the test very fragile. People adding new patterns may keep breaking these 'un-related' tests. One possible way to improve is, we have a counter for each specific pattern, in the tests, instead of check the total number of patterns matched, just check the match count for the *RELEVANT* patterns. That should reduce false-positive for broken tests. cc possible test creator @jgong5 Fixes https://github.com/pytorch/pytorch/issues/139812 (we need to have this to run this disabled test on your PR) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139942 Approved by: https://github.com/huydhn, https://github.com/jgong5	2024-11-07 22:34:25 +00:00
PyTorch MergeBot	68f1b52d8a	Revert "Turn static inline into static function (#139843 )" This reverts commit 72d3f5b26d90396f7a357fa3e5d82656ca74c102. Reverted https://github.com/pytorch/pytorch/pull/139843 on behalf of https://github.com/ZainRizvi due to Sorry but this is causing tests to fail on trunk. See [GH job link](https://github.com/pytorch/pytorch/actions/runs/11729669425/job/32675829894) [HUD commit link](`72d3f5b26d`) ([comment](https://github.com/pytorch/pytorch/pull/139843#issuecomment-2463354131))	2024-11-07 22:29:45 +00:00
Laith Sakka	d1a45800a3	refresh numbers after accepted less than noise regression (#140029 ) https://github.com/pytorch/pytorch/pull/138363 regressed some benchmarks but less than noise level updating values to avoid flakiness. <img width="803" alt="Screenshot 2024-11-07 at 10 31 29 AM" src="https://github.com/user-attachments/assets/31326452-a6ad-44b8-b324-25e953355fcf"> PASS: benchmark ('add_loop_eager', 'compile_time_instruction_count') pass, actual result 3073605220 +1.21% is within expected 3037000000 ±1.50% PASS: benchmark ('add_loop_eager_dynamic', 'compile_time_instruction_count') pass, actual result 5700849667 +1.37% is within expected 5624000000 ±2.50% Pull Request resolved: https://github.com/pytorch/pytorch/pull/140029 Approved by: https://github.com/bobrenjc93	2024-11-07 22:27:00 +00:00
Shangdi Yu	83e36a6bfa	AOTI Minifier (#139351 ) See documentation at https://docs-preview.pytorch.org/pytorch/pytorch/139351/torch.compiler_aot_inductor_minifier.html. Add a minifier for AOTI. Test Plan: python test/inductor/test_minifier.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/139351 Approved by: https://github.com/desertfire	2024-11-07 21:43:44 +00:00
Jack Taylor	8d070d23d6	[ROCm] Tune flex-attention and decode to num_stages=1 (#139883 ) Fixes #139755 #139621 The new stream pipeliner on AMD triton backend enables num_stages to function equivalent to NV backend. This upgrade in triton 3.2 will cause OOM issues in flex attention due to num_stages=3 setting, we have tuned this to num_stages=1 which is the best setting for flash attention kernels and avoids the shmem issues. We will follow up this PR with some config tuning on AMD backend. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139883 Approved by: https://github.com/bertmaher	2024-11-07 21:16:52 +00:00
PyTorch MergeBot	36e0f119d0	Revert "[experimental] async-tp impl with cutlass-based, progress aware kernel (#139227 )" This reverts commit 5203138483e97141ad96a8906f1c6f8b7ff8adc6. Reverted https://github.com/pytorch/pytorch/pull/139227 on behalf of https://github.com/yifuwang due to Need to address internal build failure D65605027 ([comment](https://github.com/pytorch/pytorch/pull/139227#issuecomment-2463204467))	2024-11-07 21:01:36 +00:00
Felix Zimmermann	d378819068	Tighten type hints for tensor arithmetic (#135392 ) Fixes #124015 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135392 Approved by: https://github.com/ezyang	2024-11-07 20:54:39 +00:00
Tom Fogal	b5286ba207	Small fix to Python rendering in documentation. (#138281 ) The text was being rendered as normal text but I believe was meant to be code. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138281 Approved by: https://github.com/janeyx99	2024-11-07 20:48:47 +00:00
Bob Ren	d8afa21ef2	specialize symfloats for wrapped_gradient in get_fake_value (#139935 ) Fixes `PYTORCH_TEST_WITH_DYNAMO=1 python test/test_torch.py TestTorchDeviceTypeCPU.test_gradient_type_promotion_cpu` when `specialize_float=False` Reviewers might wonder why we need to have this whitelist. Can't we rely on python_arg_parser.h to do the specialization generically? Alas this path doesn't actually FFI to C++ so we do need to do the specialization in pythonland. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139935 Approved by: https://github.com/ezyang ghstack dependencies: #139569, #139457, #139568, #139572, #139846, #139454, #139896	2024-11-07 20:27:02 +00:00
Nikita Shulga	bdeca2a24f	[BE] Remove warn about using Half on CPUs (#139961 ) Was added by https://github.com/pytorch/pytorch/pull/33021, but modern CPUs right now are quite capable of handling half precision types. Alternatively one can guard the warning with `#ifdef x86_64` Pull Request resolved: https://github.com/pytorch/pytorch/pull/139961 Approved by: https://github.com/jgong5	2024-11-07 20:23:42 +00:00
Catherine Lee	df136df8d5	Remove upload_test_stat_aggregates script (#139915 ) Instead of moving these queries to ClickHouse, we're just going to remove it since it's not really used. We do want something for test aggregates, but we can make a new script instead Pull Request resolved: https://github.com/pytorch/pytorch/pull/139915 Approved by: https://github.com/huydhn	2024-11-07 20:14:12 +00:00
cyy	83fa1014f1	[3/N] Replace c10::sv with std::sv (#139861 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/139861 Approved by: https://github.com/ezyang	2024-11-07 20:03:57 +00:00
Bob Ren	85204d0081	Don't wrap inf values as symfloat (#139896 ) Fixes `PYTORCH_OPINFO_SAMPLE_INPUT_INDEX=7 python test/inductor/test_torchinductor_opinfo.py TestInductorOpInfoCPU.test_comprehensive_linalg_norm_cpu_float16` when `specialize_float=False` Pull Request resolved: https://github.com/pytorch/pytorch/pull/139896 Approved by: https://github.com/ezyang ghstack dependencies: #139569, #139457, #139568, #139572, #139846, #139454	2024-11-07 20:03:54 +00:00
cyy	9d09af981b	Wrap torch_python with torch_compile_options (#136743 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/136743 Approved by: https://github.com/ezyang	2024-11-07 19:36:40 +00:00
Menglu Yu	d0da40a8b9	[PT2][Optimus] fix the default alpha and beta values (#139857 ) Summary: We noticed that the default coefficient values for beta and alpha should be int 1, instead of float 1.0, which will cause error when the inputs for the add are int types. More contex: https://fb.workplace.com/groups/1075192433118967/permalink/1539142760057263/ Test Plan: # local reproduce ``` buck2 run mode/opt scripts/shuaiyang:test -- --optimus --flow_id 660724017 2>&1 \| tee ~/local_run_shuai_660724017.txt ``` trace link: https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/traces/mengluy/2024-11-05-21-18-17/trace.json.gz&bucket=gpu_traces # E2E before fix: f660724017 after fix: Differential Revision: D65521638 Pull Request resolved: https://github.com/pytorch/pytorch/pull/139857 Approved by: https://github.com/jackiexu1992	2024-11-07 19:12:23 +00:00
cyy	72d3f5b26d	Turn static inline into static function (#139843 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/139843 Approved by: https://github.com/ezyang	2024-11-07 19:08:41 +00:00
William Wen	f5147e989c	[dynamo] prefix some eval_frame.c functions with dynamo_ (#139921 ) Fix https://github.com/pytorch/pytorch/issues/137994. I didn't prefix every function, but the ones that are on the hotpath. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139921 Approved by: https://github.com/ezyang	2024-11-07 19:07:23 +00:00
Sherlock Huang	071d48c56e	Add output_node util function to fx.Graph (#139770 ) Summary: A util function for access output node for FX graph Test Plan: OSS CI Differential Revision: D65486457 Pull Request resolved: https://github.com/pytorch/pytorch/pull/139770 Approved by: https://github.com/ezyang, https://github.com/Chillee	2024-11-07 18:54:59 +00:00
Max Podkorytov	ee54dfb64d	[Inductor][ROCm][CK] Enable lowering conv2d instances in CK Inductor backend (#138643 ) Set PYTORCH_MIOPEN_SUGGEST_NHWC environment variable to force output layout to channels-last. This way, the channels-last CK instances will be added to benchmark choices in max autotune # Testing ``` pytest test/inductor/test_ck_backend.py -k conv2d ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/138643 Approved by: https://github.com/chenyang78	2024-11-07 18:37:39 +00:00
Howard Huang	edbf57b336	[pipelining] remove extra variables (#139817 ) Cleaning up counters / extra variables not needed after https://github.com/pytorch/pytorch/pull/139415 was landed Pull Request resolved: https://github.com/pytorch/pytorch/pull/139817 Approved by: https://github.com/wconstab	2024-11-07 18:32:20 +00:00
Nikita Shulga	8f4b29810b	Fix aarch64 wheel builds (#140020 ) Shell script still referencing builder checkout rather than PyTorch, which results in ``` python /builder/aarch64_linux/aarch64_wheel_ci_build.py --enable-mkldnn python: can't open file '/builder/aarch64_linux/aarch64_wheel_ci_build.py': [Errno 2] No such file or directory ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/140020 Approved by: https://github.com/atalman	2024-11-07 18:24:34 +00:00
David Berard	eabef5000f	[user triton] reset kernel_side_table before test_tma_capture_and_functionalize (#139907 ) The test was failing when I ran the whole test suite. I'm guessing that the exact indices would previously depend on the order that tests would run; by resetting the kernel_side_table we should hopefully get results that are reproducible independent of the test execution order. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139907 Approved by: https://github.com/oulgen, https://github.com/aakhundov	2024-11-07 17:56:53 +00:00
cyy	ca7fdfe4d2	[Reland] Use static_assert to detect get_type_index used in device code (#139966 ) #139173 was reverted due to an internal build break of using get_type_index in device code. This PR is created for ease of importing into META to further investigation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139966 Approved by: https://github.com/malfet, https://github.com/huydhn Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2024-11-07 17:36:47 +00:00
Ke Wen	e474f0de82	[PGNCCL] Slimming watchdog loop (#139834 ) - Refactored traceback code into `work.printTraceback()`. cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o @shuqiangzhang - Refactored desync debug code into `class DesyncDebugger`. - Moved occurrences of `futureWorkResult_->markCompleted` into `checkAndSetException` and `checkTimeout`, respectively. cc @shuqiangzhang - Modularized dump signal broadcast code into `ProcessGroupNCCL::broadcastDumpSignal`. cc @fduwjj @c-p-i-o Pull Request resolved: https://github.com/pytorch/pytorch/pull/139834 Approved by: https://github.com/shuqiangzhang	2024-11-07 17:22:44 +00:00
PyTorch MergeBot	a60bc051e3	Revert "Fix the use of fsspec transactions (#135541 )" This reverts commit 59cf4bc5ae64aea2c6a9b870243821695adfc30b. Reverted https://github.com/pytorch/pytorch/pull/135541 on behalf of https://github.com/ZainRizvi due to Breaking internally. See D65551490 ([comment](https://github.com/pytorch/pytorch/pull/135541#issuecomment-2462774239))	2024-11-07 17:03:37 +00:00
PyTorch MergeBot	7e02386303	Revert "[2/N] Replace c10::sv with std::sv (#139456 )" This reverts commit 028c5d3426743673edbbe6e11a491d76f1402f7c. Reverted https://github.com/pytorch/pytorch/pull/139456 on behalf of https://github.com/ZainRizvi due to Sorry but this breaks internally. @ezyang can you please help get this landed? See D65546398 for more details ([comment](https://github.com/pytorch/pytorch/pull/139456#issuecomment-2462768891))	2024-11-07 17:00:59 +00:00
IvanKobzarev	781c68c865	[aotd] coerce_same_metadata_as_tangent with expected_type for e.g.AsyncCollectiveTensor (#139095 ) Based on discussion here: https://github.com/pytorch/pytorch/pull/138731 Introducing ability for subclass implement type convertion to expected_type. ``` def __coerce_same_metadata_as_tangent__( self, expected_metadata: Any, expected_type: Optional[Type] = None ): ``` Here if `expected_type=None` means `SubclassClass` is expected. E.g. for `DTensor` we may find tangent `AsyncCollectiveTensor` where we expected `Tensor` - in this case `expected_type=Tensor` will be called during runtime Adding implementation to AsyncCollectiveTensor, that just triggers `wait()`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139095 Approved by: https://github.com/bdhirsh	2024-11-07 16:24:48 +00:00
Bob Ren	8d3d47e439	Trigger symfloat specialization in argument binding code (#139454 ) Fixes the test `python test/inductor/test_torchinductor.py CpuTests.test_upsample_cat_conv_cpu` when `specialize_float=False` Pull Request resolved: https://github.com/pytorch/pytorch/pull/139454 Approved by: https://github.com/ezyang ghstack dependencies: #139569, #139457, #139568, #139572, #139846	2024-11-07 16:10:23 +00:00
James Wu	c35a01173b	Remove compile event logging for automatic dynamic (#139891 ) Summary: These events are a pretty large portion of the table, but not really currently used. Only log to tlparse for now. Test Plan: Unit tests Differential Revision: D65539986 Pull Request resolved: https://github.com/pytorch/pytorch/pull/139891 Approved by: https://github.com/Skylion007, https://github.com/ezyang	2024-11-07 14:52:10 +00:00
Annop Wongwathanarat	81ecf98d23	Pass all arguments when quantizing embedding bag from float (#137697 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137697 Approved by: https://github.com/snadampal, https://github.com/jerryzh168	2024-11-07 09:53:49 +00:00
sanchitintel	314aa268ce	In AMX GEMM micro-kernel, use same dtype for A & B only if B is dequantized (#139906 ) @frost-intel discovered that some Inductor auto-tuning UTs for CPU are currently broken on machines supporting AMX ISA. That's because in #136688, I had reverted a change in the AMX GEMM micro-kernel that was introduced in #131887, but it looks like some other implementations introduced after the aforementioned change rely upon it, so it should not have been reverted. Added a fix. Ideally, a CI machine that supports AMX should cover these UTs (test/inductor/test_cpu_select_algorithm.py). We do have at least one CI machines that support AMX. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139906 Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5	2024-11-07 09:18:59 +00:00
Bob Ren	a4e7b8001c	refuse to generate a symbolic variable if a float input is inf (#139846 ) Fixes `PYTORCH_TEST_WITH_INDUCTOR=1 tlp python test/test_torch.py TestTorchDeviceTypeCPU.test_cauchy_cpu_float64` when `specialize_float=False` Pull Request resolved: https://github.com/pytorch/pytorch/pull/139846 Approved by: https://github.com/ruidazeng, https://github.com/ezyang ghstack dependencies: #139569, #139457, #139568, #139572	2024-11-07 09:16:55 +00:00
xinan.lin	c4a323ed05	[Inductor] Generalize device-bias code newly introduced in scheduler.py (#139872 ) [Inductor] Generalize device-bias code newly introduced in scheduler.py to align the Inductor behavior for xpu with cuda. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139872 Approved by: https://github.com/EikanWang, https://github.com/jansel, https://github.com/guangyey ghstack dependencies: #139705	2024-11-07 07:10:28 +00:00
xinan.lin	320374b011	[Inductor] Refine triton_bundler.py to support correctly on Intel GPU and fix CI failures. (#139705 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139705 Approved by: https://github.com/EikanWang, https://github.com/jansel, https://github.com/guangyey	2024-11-07 07:10:28 +00:00
Mengwei Liu	3caf56d97a	Mark full_like as core ATen (#139937 ) Fixes #139617 As titled. For ExecuTorch `full_like` is implemented so this should be fine: https://github.com/pytorch/executorch/blob/main/kernels/portable/cpu/op_full.cpp Also there are decompositions for ops such as `fill.Scalar` that gives `full_like`: https://github.com/pytorch/pytorch/blob/main/torch/_decomp/decompositions.py#L164 Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/139937 Approved by: https://github.com/tugsbayasgalan	2024-11-07 07:08:18 +00:00
FFFrog	c03324de2d	Make Context to be Device-agnostic Step by Step (2/N) (#136526 ) ---- - add new method(getDefaultGenerator, getNewGenerator) into AcceleratorHooksInterface Pull Request resolved: https://github.com/pytorch/pytorch/pull/136526 Approved by: https://github.com/ezyang, https://github.com/EikanWang	2024-11-07 06:28:47 +00:00
Max Podkorytov	ca30704f0b	[Inductor][ROCm][CK] Add standalone runner (#139441 ) Generate standalone executable to debug and profile CK gemm instances Pull Request resolved: https://github.com/pytorch/pytorch/pull/139441 Approved by: https://github.com/ColinPeppler	2024-11-07 06:21:27 +00:00
Zhenbin Lin	d36fdaf157	Openreg: Support stream (#136991 ) Support stream. When the driver communicates with the executor, it will send the stream id corresponding to the execution command; when the executor receives the command with the stream id, it will ignore the stream id because cpu backend doesn't support asynchronous execution. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136991 Approved by: https://github.com/ezyang	2024-11-07 06:09:07 +00:00
Yidi Wu	ab42967238	[hop free symbols] lift free symbols in example_value when create_graph_input (#138363 ) There are 4 parts (they are hard to further break into smaller ones cause they're highly coupled) in this PR: 1. Whenever we call create_graph_input, we try to bind the symbols in the graph input. We've enforced the invariant that all create_graph_inputs calls must provide an example value, we could intercept at the create_graph_input calls (This PR only handles free symbols in tensors). 2. We cache the bound_symbols to avoid lift the same symbol repeated. 3. For lifted symbols, we re-used lifted_freevars i.e. the mapping between symbol proxy in parent graph to the lifted phs in current subgraph, which we handle lifted tensors. In this way, all hops that supports lifted tensors should be able to handle lifted_symints automatically (at least in dynamo part). 4. For unbacked symbols created during tracing, we need to also bound these symbols to its proxy. This is to support the tests cases where we want to lift unbacked symbols as input. We need the proxy of the unbacked symbol in parent graph in order to properly create the args to the hop. 5. We change all the tests after free symbols are lifted in subgraphs. And also supports the lifted symbols in existing higher order ops. The interaction of nested tracers: The previous design for lifting tensor closures is that: suppose we're in nested tracers, whenever we see a new proxy that's not created by create tracer, we recursively look for the proxy in parent tracer until we find the tracer that creates this proxy (either a placeholder or some intermediate results). More detail is in Note [Nested SubgraphTracer and free_variable handling]. Given the above design, the plan for lifting the free symbols is: whenever we lift a free tensor to be the inputs of current subgraph, we'll look at the symbols in it and bind the symbols at the same time. For example, suppose we have the following function: ```python def f(x: [s1, s2]): def true_f(): def true_f_inner(): return x.sin() ``` what will happen in time order: 1. we create a subtracer 1 and start to speculate the outer cond's true_f 2. we create a another subtracer 2 and start to speculate the inner cond's true_f_inner. 3. dynamo realize the tensor input x by calling wrap_tensor in top-level to create graph input x (tracer 0), we bind the symbol s1, s2 after ph for x is created. So the graph now looks like: ```python def gm(s1, s2, x): ``` 4. when seeing TensorVariable.call_method of x, tracer2 wants to create a call_function(sin, proxy_of_x), but it finds that proxy_of_x is not created by current tracer. So it recursively look up its parent tracer1 and find parent tracer1 also doesn't track this proxy_of_x then it finds the root tracer0, who is the creator of it and tracks it as a ph. Then tracer 1 create_graph_input to lift the closure to its input ph1 and add (proxy_of_x: ph1) k-v in lifted_freevars of tracer 1. Now the graph looks like: ```python def gm(s1, s2, x): def true_gm(x): ``` 5. Since there are free symbols inside this new tensor input, tracer 1 also binds the symbols (maybe_bind_symbol), which calls create_graph_input for s1 and s2. Now the graph looks like ```python def gm(s1, s2, x): def true_gm(s1, s2, x): ``` 6. then it goes back to tracer 2, and call create_graph_input for x and get ph2, tracer 2's lifted_freevars records (ph1, ph2). and tracer 2 also binds the symbols in this new tensor input. Now the graph looks like: ```python def gm(s1, s2, x): def true_gm(s1, s2, x): def true_gm_inner(s1, s2, x): ``` 7. Finally the sin call_function node is created by tracer 2. This PR also handles the following cases: - What if we lift two tensors share the same symbol? e.g. x1 [s1, s2], x2 [s2, s3]? Each subtracer maintains bound_symbols as a cache that maps a symbol.expr to its proxy in current tracer. So when we see x1, we'll track s1 and s2 as inputs and bound s1 to ph1, s2 to ph2. So when we try to bind symbols of x2, s2 will already be tracked so no graph input is created. - what if a subgraph close over a symint? e.g. ```python def f(x): def true_f(): c = x.size(0) def true_fn_inner(): return c ``` When we speculate true_fn_inner, we find proxy_of_c is not tracked by tracer 2, so it recursively looks up its parent. At this point, x and its symbols have been lifted as input of true_f (as a result of lifting x during tracing true_f in tracer 1. Specifically the graph looks like: ```python def gm(s1, s2, x): def true_gm(s1, s2, x): def true_gm_inner(): ``` So tracer 2 is able to find that s1 have been tracked as ph in tracer 1 so it returns back to gm and call create_graph_input on s1. The graph now looks like: ```python def gm(s1, s2, x): def true_gm(s1, s2, x): def true_gm_inner(s1): return s1 ``` - What if subgraph close over an unbacked symint? e.g. ```python def f(x): def true_f(): c = x.item() def true_f_inner(): return c ``` When x.item() is called, proxy_of_c and its symnode variable is created for tracer 1, and we also call track_unbacked_symbols to record this relationship. So when tracer 2 finds proxy_of_c is not created by current tracer, it recursivelly looks up its parent tracer and finds that that expression u0 has been tracked as a result of track_unbacked_symbol in tracer 1. So it will stop the recursion and create_graph_input u0 in tracer 2. Graph looks like: ```python def f(x): def true_f(s1, s2, x): c = x.item() def true_gm_inner(u0): return u0 cond(pred, true_gm_inner, false_gm_inner, (c,)) ``` - what if subgraph close over a tensor with unbacked symint shape? ```python def f(x): def true_f(): c = x.item() r = torch.randn((c,)) def true_f_inner(): return r + 1 ``` This is the same as the case of closing over tensors with backed shapes. where we first lift r, then bind u0 in it, which recursively bind_symint of u0 in its parent and found u0 is tracked in parent tracer as a result of .item() call. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138363 Approved by: https://github.com/zou3519	2024-11-07 04:44:32 +00:00
Justin Chu	3368f3ad41	[ONNX] Update TorchTensor implementation to handle fake mode (#139534 ) Update TorchTensor implementation to handle fake mode better. Specifically, we disable fake mode before calling detach() etc. when getting the weights if it is already a real tensor so we do not lose it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139534 Approved by: https://github.com/fatcat-z, https://github.com/titaiwangms	2024-11-07 04:36:24 +00:00
Gabriel Ferns	2037ea3e15	Add type annotations to Configs (#139833 ) Summary: Adds types to Configs, and fixes a bug in options that was caused by the lack of types. fixes: https://github.com/pytorch/pytorch/issues/139822 Configs are used by many modules so not sure which label to put. Types also allow https://github.com/pytorch/pytorch/pull/139736 to fuzz configs Pull Request resolved: https://github.com/pytorch/pytorch/pull/139833 Approved by: https://github.com/c00w	2024-11-07 03:49:09 +00:00
Yifu Wang	5203138483	[experimental] async-tp impl with cutlass-based, progress aware kernel (#139227 ) This PR introduces the following: ### torch.ops.symm_mem._async_input_mm `_async_input_mm(Tensor a, Tensor b, Tensor a_chunk_signals, int a_chunk_pivot) -> Tensor` An mm impl that supports consuming asynchronous input. It guarantees the following rasterization order, and that the corresponding signal arrives before an input chunk is consumed. ``` num_chunks = a_chunks_signals.numel() for chunk_idx in range(a_chunk_pivot, num_chunks + a_chunk_pivot): chunk_idx = chunk_idx % num_chunks wait_signal(a_chunk_signals, chunk_idx) # Compute output tiles that consumes the input chunk ``` ### PersistentAsyncInputScheduler This is a forked version of PersistentScheduler that supports consuming asynchronous input. This tile scheduler introduces the following arguments: - `tiles_per_chunk_m` – Specifies the size of an M chunk. Chunks are the granularity at which the asynchronous input becomes ready. It must be an interger multiple of the size of an M tile. - `chunk_signals` – `chunk_signals[i] == 1` indicates that chunk i is ready. Before returning a work tile, get_current_work() waits for the signal to ensure that the corresponding chunk is ready. - `tile_idx_pivot_m` – After applying swizzling, apply `pivot(m) => (m + tile_idx_pivot_m) % tiles_m` to `m`. In a distributed setting, this allows different ranks to process different m indices at the same time, thus avoiding communication hotspots. Note that this scheduler currently only supports the `KernelTmaWarpSpecializedCooperative` kernel schedule. This is enforced via the template argument `KernelSchedule`. Usage: ``` using GemmKernel = cutlass::gemm::kernel::GemmUniversal< Shape<int, int, int, int>, CollectiveMainloop, CollectiveEpilogue, cutlass::gemm::PersistentAsyncInputScheduler<KernelSchedule>>; ``` ### _fused_all_gather_matmul_native An ag-mm impl that combines `torch.ops.symm_mem._async_input_mm` and progress-aware all-gather. This is not yet enabled via the async-tp passes. We will use it as a backend to optimize the current decomposition-based async-tp impl. ## Benchmarks ### 4096x3584x8192 - cublas + nccl: 539us - decomp-based async-tp w/o cuda graph: 694us - decomp-based async-tp w/ cuda graph: 478us - new cutlass kernel: 408us <img width="478" alt="image" src="https://github.com/user-attachments/assets/39f316ab-36c5-4b41-af77-07854a385dfc"> ### 2048x3584x8192 - cublas + nccl: 301us - decomp-based async-tp w/o cuda graph: 687us - decomp-based async-tp w/ cuda graph: 356us - new cutlass kernel: 276us <img width="441" alt="image" src="https://github.com/user-attachments/assets/9e23ce21-863b-43dd-a562-fb05d3a5a144"> ## Next Steps - Add tuning logic - Use `_fused_all_gather_matmul_native` as a backend for the decomp-based async-tp impl Pull Request resolved: https://github.com/pytorch/pytorch/pull/139227 Approved by: https://github.com/weifengpy, https://github.com/Chillee	2024-11-07 03:43:12 +00:00
Sun, Jiayi	a59132b9c8	fix torch.linalg.norm and torch.norm for torch.complex32 datatype (#133661 ) Fix https://github.com/pytorch/pytorch/issues/132634. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133661 Approved by: https://github.com/mingfeima, https://github.com/Skylion007	2024-11-07 03:21:36 +00:00
PyTorch MergeBot	604e353cae	Revert "Loosen last dim contiguity for sdpa constraint to include last dim 0,1 (#139787 )" This reverts commit 060bee7f22a6ff5c14562713dc4bb6aa74923469. Reverted https://github.com/pytorch/pytorch/pull/139787 on behalf of https://github.com/huydhn due to Sorry for reverting this, but I think it is failing on ROCm ([comment](https://github.com/pytorch/pytorch/pull/139787#issuecomment-2461234683))	2024-11-07 03:17:16 +00:00
Ryan Guo	f459c3095f	[dynamo] Document codegen and clean up some code paths (#139670 ) This patch 1. Adds documentation to `PyCodegen.__call__`, `PyCodegen.tempvars` and the `allow_cache` flag. 2. Merges a few existing code paths in `PyCodegen.__call__`. 3. removes the `elif var in cg.tempvars` code path in `codegen_save_tempvars`, because it's no longer needed after #113725, as we have up-to-date `VariableTracker.source` now. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139670 Approved by: https://github.com/jansel ghstack dependencies: #139538	2024-11-07 03:14:16 +00:00
Ryan Guo	183b386cb2	[dynamo] Simplify Codegen for variables with `MutableSideEffects` (#139538 ) This effectively undoes #115095, which is not longer be needed after #113725. Why did we need #115095? I went back in history and found that [this line](https://github.com/pytorch/pytorch/pull/113725/files#diff-0bb1756725c4426408938314b0c9d3988ae5bf49994892d7038ad7746e209e9fR86) actually fixed what #115095 fixed. Specifically, without the `allow_cache` check for the "dup_top" optimization, we could incorrectly codegen based on source, despite `codegen_update_mutated` requested to codegen from value, for updates to pre-existing lists, etc. Since #113725 added the `allow_cache` check, we no longer need the `mutable_side_effects_from_source` code path from #115095. However, #115442 introduced a `value_from_source` flag which didn't account for the `mutable_side_effects_from_source` branch. So this patch adds an extra check to keep existing behavior for export, and leaves a TODO for investigating what exactly export wants from codegen, when it comes to side effects and sources. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139538 Approved by: https://github.com/jansel	2024-11-07 03:14:16 +00:00
Valentine233	cf0bb6c435	[cpu] Modify inductor opt flag --- ftree-loop-vectorize (#136827 ) Reopen https://github.com/pytorch/pytorch/pull/121782, as more optimizations have landed. Fixes https://github.com/pytorch/pytorch/issues/115261, https://github.com/pytorch/pytorch/issues/113017. For CPU inductor path, remove -ftree-loop-vectorize from optimization flags to fix functional issues. ### Validation on 3 benchmark suites #### FP32 ![image](https://github.com/user-attachments/assets/ec920928-fa36-467f-ba07-d2c05c51b92e) Outlier models (speedup<0.8, single socket): None. #### BF16 ![image](https://github.com/user-attachments/assets/4a301e5e-147d-4b74-beb1-40290969ed80) Outlier models (speedup<0.8, single socket multi threads): - functorch_dp_cifar10 0.58 - opacus_cifar10 0.57 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136827 Approved by: https://github.com/jansel, https://github.com/jgong5	2024-11-07 02:49:52 +00:00
Gregory Comer	617b4538f1	Support symbolic builtin round in export (#139549 ) Differential Revision: D65380866 Pull Request resolved: https://github.com/pytorch/pytorch/pull/139549 Approved by: https://github.com/digantdesai, https://github.com/angelayi	2024-11-07 02:49:44 +00:00
FEI	54e680151b	Optimize peak memory for flash _scaled_dot_product_attention_math (#139612 ) (#139613 ) Fixes #139612 @drisspg @albanD Pull Request resolved: https://github.com/pytorch/pytorch/pull/139613 Approved by: https://github.com/drisspg	2024-11-07 02:25:39 +00:00
Will Constable	2b400236c2	[DCP] Cross-link DCP doc to tutorials (#139776 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139776 Approved by: https://github.com/mhorowitz, https://github.com/LucasLLC, https://github.com/fduwjj ghstack dependencies: #139938	2024-11-07 02:19:49 +00:00
Will Constable	b51b7e28ee	Add DCP doc to DCP merge-rules (#139938 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139938 Approved by: https://github.com/LucasLLC, https://github.com/c-p-i-o, https://github.com/fduwjj	2024-11-07 02:19:49 +00:00
Edward Z. Yang	4e647871d6	Ensure TORCH_TRACE is run for Dynamo/Distributed tests (#139786 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/139786 Approved by: https://github.com/bobrenjc93, https://github.com/c00w, https://github.com/anijain2305 ghstack dependencies: #139716	2024-11-07 01:58:05 +00:00
Chirag Pandya	47446cb5f3	[fr][c10d] move logger out from utils.py (#139806 ) Summary: Move flight recorder logger class out from utils.py into its own file. This makes the program more modular. This is mostly a refactoring/non-functional change. Test Plan: Build fr_trace locally and ran it. ``` buck build //caffe2/fb/flight_recorder:fr_trace Buck UI: https://www.internalfb.com/buck2/875ca6a3-e86e-4263-95a0-579502494c5c Network: Up: 0B Down: 0B Jobs completed: 6818. Time elapsed: 0.2s. BUILD SUCCEEDED ``` Ran it as follows: ``` cd buck-out/v2/gen/fbcode/caffe2/fb/flight_recorder ./fr_trace.par -p trace_ /tmp Not all ranks joining collective 3 at entry 2 group info: 0:default_pg collective: nccl:all_reduce missing ranks: {1} input sizes: [[4, 5]] output sizes: [[4, 5]] expected ranks: 2 collective state: scheduled collective stack trace: <module> at /home/cpio/test/c.py:66 ``` Differential Revision: D65503768 Pull Request resolved: https://github.com/pytorch/pytorch/pull/139806 Approved by: https://github.com/fduwjj	2024-11-07 01:44:12 +00:00
Bin Bao	d0ffd6d142	[AOTI] Add data_ptr to RAIIAtenTensorHandle (#139895 ) Summary: To increase the readbility of the generated code. This is not BC-breaking, because RAIIAtenTensorHandle is implemented as header-only. Differential Revision: [D65547216](https://our.internmc.facebook.com/intern/diff/D65547216) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139895 Approved by: https://github.com/chenyang78	2024-11-07 01:36:28 +00:00
sgui/a3213105	4ddf015e7d	[ONNX export] exporting model to onnx error when tensor.index_fill ops met dim=0 #139594 (#139596 ) When fill_index op's param dim==0, there is no need to unsqueeze the index tensor's dimension. So we return index tensor directly if ths size of axes_i == 0 Fixes #139594 Pull Request resolved: https://github.com/pytorch/pytorch/pull/139596 Approved by: https://github.com/justinchuby	2024-11-07 01:32:34 +00:00
Bin Bao	bd5a2c2c71	[AOTI] Simplify the return code (#139889 ) Summary: ``` if constexpr (std::is_same_v<std::decay_t<decltype(buf3)>,RAIIAtenTensorHandle> \|\| std::is_same_v<std::decay_t<decltype(buf3)>,AtenTensorHandle> \|\| std::is_same_v<std::decay_t<decltype(buf3)>,ConstantHandle>) { output_handles[0] = buf3.release(); } else { thread_local ThreadLocalCachedOutputTensor<std::decay_t<decltype(buf3)>> cached_output_0(buf3); cached_output_0.copy_data_from(buf3); AOTI_TORCH_ERROR_CODE_CHECK(aoti_torch_new_uninitialized_tensor(&output_handles[0])); AOTI_TORCH_ERROR_CODE_CHECK(aoti_torch_assign_tensors(cached_output_0.tensor(), output_handles[0])); } ``` -> ``` output_handles[0] = buf3.release(); ``` Test Plan: CI Differential Revision: D65460719 Pull Request resolved: https://github.com/pytorch/pytorch/pull/139889 Approved by: https://github.com/chenyang78	2024-11-07 01:28:43 +00:00
Valentine233	6fcef86cfa	[inductor] fix the unligned variable ranges issue in fuse node (#138568 ) Fixes #138550. ### Description In the fusion of two nodes, one node with less variables (`node_to_recomp`) would make its variable ranges aligned with the other node (`ref_node`). In detail, `node_to_recomp` would change its variable ranges to the original ranges of `ref_node`. However, if both of the nodes have changed its ranges, i.e., the simplified variable ranges are different from its original ones, the issue comes up. ### Solution For the case where the `ref_node` also changes its variable ranges, we recompute the size and body for it, to ensure the nodes are simplified to the same size. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138568 Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel, https://github.com/jansel	2024-11-07 01:17:58 +00:00
Danial Javady	ed0e63e938	Add NHWC support for group normalization (#126635 ) Fixes #111824 Currently it is the case that if the user specifies their group normalization to be of NHWC format, pytorch will default to NCHW tensors and convert. This conversion is not immediately obvious to the user unless they check the format themselves which is not intuitive. This PR adds suppor for NHWC for cuda by adding necessary kernels. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126635 Approved by: https://github.com/eqy, https://github.com/mikaylagawarecki	2024-11-07 01:12:08 +00:00
Songhao Jia	59ec011855	[numerical debugger] bumped up the starting handler id (#139666 ) Differential Revision: D65445250 Pull Request resolved: https://github.com/pytorch/pytorch/pull/139666 Approved by: https://github.com/tarun292, https://github.com/dulinriley	2024-11-07 01:00:43 +00:00
Colin L. Rice	e675c6702d	justknobs: Remove JustKnobsConfig and justknobs_feature (#138767 ) This never ended up getting used, and instead we're doing this resolution within the configuration system. Removing these unused internal features. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138767 Approved by: https://github.com/ezyang ghstack dependencies: #138766, #138956	2024-11-07 00:21:46 +00:00
Sam Larsen	52446d7f30	Revert D65290089 (#139893 ) Summary: This diff reverts D65290089 This change is introducing more logging than I realized and could present problems for tlparsen Test Plan: NA Reviewed By: jamesjwu Differential Revision: D65541060 Pull Request resolved: https://github.com/pytorch/pytorch/pull/139893 Approved by: https://github.com/jamesjwu	2024-11-07 00:10:09 +00:00
Animesh Jain	ac5fa26e07	[dynamo][weakref] Support weakref.ref call (#139914 ) Should fix - https://github.com/pytorch/pytorch/pull/135001 Pull Request resolved: https://github.com/pytorch/pytorch/pull/139914 Approved by: https://github.com/jansel ghstack dependencies: #139856	2024-11-06 23:16:41 +00:00
Animesh Jain	738bfff5f9	[dynamo][user-defined] Fix bugs with method descriptors (#139856 ) Should fix some problems in https://github.com/pytorch/pytorch/pull/138080 Pull Request resolved: https://github.com/pytorch/pytorch/pull/139856 Approved by: https://github.com/jansel	2024-11-06 23:16:40 +00:00
Huy Do	ed16f28f02	Fix ExecuTorch CI after landing #6564 (#139700 ) After landing https://github.com/pytorch/executorch/pull/6564, we need to update the pinned ExecuTorch commit on PyTorch is fix the regression on PyTorch side. The change to `.ci/docker/common/install_executorch.sh` is needed because it's how the dependencies are setup on ExecuTorch CI now. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139700 Approved by: https://github.com/larryliu0820, https://github.com/malfet	2024-11-06 23:04:35 +00:00
eellison	060bee7f22	Loosen last dim contiguity for sdpa constraint to include last dim 0,1 (#139787 ) Previously we were checking for a last dim with stride == 1. When the size is <= 1 that also is sufficient because the stride is insignificant. Fix for https://github.com/pytorch/pytorch/issues/138317 Pull Request resolved: https://github.com/pytorch/pytorch/pull/139787 Approved by: https://github.com/drisspg	2024-11-06 22:53:01 +00:00
atalman	56a40d4ebb	Add conda to Manylinux Docker images (#139903 ) We would like to switch https://github.com/pytorch/test-infra/blob/main/.github/workflows/linux_job.yml from ``pytorch/conda-builder`` to ``pytorch/manylinux-builder`` and later to ``pytorch/manylinux_2_28-builder`` . Hence adding conda to these images. Test Infra PR that does the switch : https://github.com/pytorch/test-infra/pull/5867 - need to be rebased after this PR is merged Pull Request resolved: https://github.com/pytorch/pytorch/pull/139903 Approved by: https://github.com/seemethere	2024-11-06 22:49:36 +00:00
Sam Larsen	b8cf324e50	[pt2 logging] move remote cache get/put logging up one level (#139423 ) Summary: I need to refactor the way we record CompilationMetrics. It will be much easier to do in OSS and having the relevant timing code in the OSS area of the codebase will make this much easier. I doubt this meaningfully changes the values we see. Test Plan: Made sure samples show up: https://fburl.com/scuba/dynamo_compile/sandbox/c38zjq0x Differential Temp Revision: D65290089 Pull Request resolved: https://github.com/pytorch/pytorch/pull/139423 Approved by: https://github.com/oulgen	2024-11-06 22:44:53 +00:00
Max Podkorytov	8f077b811b	[ROCm][Inductor]Fixing missing ck package warning when the backend is disabled (#139790 ) ``` test_addmm_multiple_dynamic_cuda (__main__.AOTInductorTestABICompatibleCuda) ... W1101 10:26:20.492000 1361741 torch/_inductor/utils.py:1207] Please pip install Composable Kernel package AUTOTUNE addmm(16x6, 16x16, 16x6) triton_mm_0 0.0104 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=0, num_stages=2, num_warps=1 triton_mm_1 0.0104 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=2, num_warps=1 SingleProcess AUTOTUNE benchmarking takes 0.2182 seconds and 0.2979 seconds precompiling for 2 choices ``` This PR disables the warning message when the CK backend is disabled Pull Request resolved: https://github.com/pytorch/pytorch/pull/139790 Approved by: https://github.com/ColinPeppler, https://github.com/chenyang78	2024-11-06 22:04:32 +00:00
Aaron Gokaslan	cbf449c83c	[BE]: Add NT missing fp classification functions (#139890 ) Follow up to some issues @malfet's recent PR pointed out about missing ops #139763. Tried to mirror it to other important nearby ops. Seems like we could automate / autogen this more for generic pointwise ops like this. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139890 Approved by: https://github.com/malfet	2024-11-06 22:00:54 +00:00
eellison	aafb3deaf1	Remove multinomial from cudagraph skip list' (#139897 ) Since https://github.com/pytorch/pytorch/pull/134818/files we can run multinomial in cudagraph without error. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139897 Approved by: https://github.com/BoyuanFeng	2024-11-06 21:28:42 +00:00
Justin Chu	86475dfc9f	[ONNX] Prioritize strict=False export strategy (#139905 ) Prioritize the `strict=False` export strategy in ONNX export because it is preferred according to @SherlockNoMad Pull Request resolved: https://github.com/pytorch/pytorch/pull/139905 Approved by: https://github.com/titaiwangms, https://github.com/xadupre	2024-11-06 21:27:29 +00:00
Shunting Zhang	779c0b80cd	[inductor] collect memory snapshort in the wrapper (#138429 ) To collect memory snapshot for a generated wrapper, run the wrapper with `--cuda-memory-snapshot`. E.g. ``` python /tmp/torchinductor_shunting/tmpyhtfwdlv/wp/cwpulanbieu4beruc6w5uc3podcs2x3rzdk5okftu37c4k3bnd4b.py --cuda-memory-snapshot ``` gives me: <img width="800" alt="Screenshot 2024-11-05 at 3 53 47 PM" src="https://github.com/user-attachments/assets/82edd2d6-df57-488e-a390-8fa5fc00ba5f"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/138429 Approved by: https://github.com/eellison, https://github.com/jansel ghstack dependencies: #139136, #138756	2024-11-06 21:22:18 +00:00
Colin L. Rice	2a857e940d	config: Add env_name_default and env_name_force to Config (#138956 ) This allows Configs to handle setting their defaults (or overriding themselves) via environment variables. The environment variables are resolved at install time (which is usually import time). This is done 1) to avoid any race conditions between threads etc..., but 2) to help encourage people to just go modify the configs directly, vs overriding environment variables to change pytorch behaviour. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138956 Approved by: https://github.com/ezyang ghstack dependencies: #138766	2024-11-06 21:20:42 +00:00
Oguz Ulgen	1270c78268	Add logging for num_triton_bundles (#139807 ) Summary: Adding logs for number of inductor cache triton bundles Test Plan: Ran adhoc code and looked at dynamo_compile/sandbox https://fburl.com/scuba/dynamo_compile/sandbox/nhktfy19 Differential Revision: D65490826 Pull Request resolved: https://github.com/pytorch/pytorch/pull/139807 Approved by: https://github.com/masnesral	2024-11-06 21:11:04 +00:00
PyTorch MergeBot	9018326bb8	Revert "[pt2 logging] move remote cache get/put logging up one level (#139423 )" This reverts commit c412a42ae2a978122d8a41b94c3861290bc689e0. Reverted https://github.com/pytorch/pytorch/pull/139423 on behalf of https://github.com/ZainRizvi due to Reverted internally. See D65541060 for more details ([comment](https://github.com/pytorch/pytorch/pull/139423#issuecomment-2460765579))	2024-11-06 20:59:54 +00:00
zeshengzong	ff616c26fb	Optimize isclose description (#139724 ) Fixes #139563 Make description user friendly. After Change: ![image](https://github.com/user-attachments/assets/88a805c0-0105-4441-812b-582c09abc72b) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139724 Approved by: https://github.com/janeyx99	2024-11-06 19:30:44 +00:00
PyTorch MergeBot	dd6738c1ad	Revert "Use Manylinux2_28 for wheel builds (#138732 )" This reverts commit 5860c8ebd155bd06666d87811847b73040b55f7b. Reverted https://github.com/pytorch/pytorch/pull/138732 on behalf of https://github.com/atalman due to Reverting for now will be relanding ([comment](https://github.com/pytorch/pytorch/pull/138732#issuecomment-2460570980))	2024-11-06 19:12:52 +00:00
Joel Schlosser	3abbde976d	Allow any single non-batch dim to be ragged for NJT (#137125 ) Fixes #137512 Relaxes the restriction that the ragged dim is immediately next to the batch dim e.g. `(B, *, D_0, ..., D_N)`. This allows for constructing NJTs of shape e.g. `(B, D, j0)` directly. It's possible before this PR to get an NJT of e.g. shape `(B, D, j0)` by constructing an NJT of shape `(B, j0, D)` and transposing it. This PR allows a user to go straight there without the transpose. The standard `torch.nested.nested_tensor(list)` constructor has been updated to support this. At the very least, this is useful for testing on transposed NJTs. I'm willing to make this functionality private if needed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137125 Approved by: https://github.com/cpuhrsch, https://github.com/soulitzer	2024-11-06 18:50:08 +00:00
Bin Bao	d1e2e81ede	[AOTI] Fix two test failures from #139471 (#139885 ) Summary: https://github.com/pytorch/pytorch/pull/139471 caused two internal test failures due to different compiler path settings. Differential Revision: D65519537 Pull Request resolved: https://github.com/pytorch/pytorch/pull/139885 Approved by: https://github.com/hl475	2024-11-06 18:41:28 +00:00
Frank Li	6ed237e5b5	[pytorch] Make global module hook to pass kwargs similar to how module hook works (#137403 ) Differential Revision: D63576353 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137403 Approved by: https://github.com/mikaylagawarecki	2024-11-06 18:20:57 +00:00
Jay Zhang	99deedff57	[ONNX] Describe memory usage of TorchDynamo-based exporter. (#139388 ) Add a new documentation to show one memory usage benefit brought by TorchDynamo-based ONNX exporter. Also add a unit test to make sure TorchDynamo-based ONNX exporter works well under FakeTensorMode. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139388 Approved by: https://github.com/xadupre	2024-11-06 17:29:11 +00:00
Huy Do	d6034016e2	Run slow jobs in trunk commits (#139842 ) Per our discussion in https://fburl.com/gdoc/voce5o06, we will run slow jobs more frequently on all trunk commits. Note that slowgradcheck jobs are moved to periodic as they are not about running slow tests. There are currently 3 GPU + 2 ROCm + some CPU `linux.4xlarge` runners running slow jobs. So, I don't expect to see a big increase in CI cost after this. Also, these slow jobs will only run in trunk commits, not in PRs, so their duration won't affect PR TTS. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139842 Approved by: https://github.com/clee2000	2024-11-06 17:21:39 +00:00
atalman	8d983aaf68	Add conda install to Manylinux 2_28 images (#139894 ) This way we can use these images instead of conda-build images for all workflows in test-infra. Please note: - I am using existing conda install script, thats alredy used in https://github.com/pytorch/pytorch/blob/main/.ci/docker/conda/Dockerfile#L47 - PR with update to miniforge will be posted as followup Pull Request resolved: https://github.com/pytorch/pytorch/pull/139894 Approved by: https://github.com/Skylion007, https://github.com/seemethere	2024-11-06 17:14:27 +00:00
Bin Bao	6bdbc86550	[AOTI] Fix a cubin file path issue (#139848 ) Summary: When we use aoti_compile_and_package to package the AOTI compiled artifacts, cubin files will be included, and at the deploy time, we should setup the cubin file directory to the right path that contains unziped cubin files. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139848 Approved by: https://github.com/aakhundov	2024-11-06 16:45:30 +00:00
James Wu	dd6a5de00d	Allow OpOverloadPackets as safe torch functions, sanitize dynamo gm before running aotdispatch with cache (#139785 ) Summary: This diff implements two things to improve cache hit rates after testing AOTAutogradCache with internal cogwheel jobs: - We should allow torch functions that are OpOverloadPackets - When running with cache, there are some fields that dynamo puts into the input graph module to aotdispatch that are not stable between runs. We use a context manager to null these out so that they can't be used to affect the output of AOTAutograd, and then we put the fields back onto the gm before returning from AOTAutogradCache.load(). Test Plan: New unit tests + running nanogpt with AOTAutogradCache. Meta: Run on a long running job Cache miss: {F1953831996} Cache hit: {F1953830872} Servicelabs here: https://www.internalfb.com/servicelab/experiment/4301352991/ Cache hit: https://interncache-all.fbcdn.net/manifold/tlparse_reports/tree/logs/f660597709-TrainingApplication/attempt_0/version_0/rank_0/index.html Cache miss: https://interncache-all.fbcdn.net/manifold/tlparse_reports/tree/logs/f660569960-TrainingApplication/attempt_0/version_0/rank_0/index.html We can see that with these changes, autograd cache hits and saves compile time: https://fburl.com/scuba/pt2_compile_events/ycddxstd Differential Revision: D65436373 Pull Request resolved: https://github.com/pytorch/pytorch/pull/139785 Approved by: https://github.com/bdhirsh	2024-11-06 16:34:02 +00:00
Edward Z. Yang	e05a096c49	Ignore polyfill when reporting user backtraces in summarized form (#139850 ) Fixes https://github.com/pytorch/pytorch/issues/139316 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/139850 Approved by: https://github.com/bobrenjc93	2024-11-06 16:33:34 +00:00
Nikita Shulga	68ef445c33	[MPS][Perf] Dispatch to SDP-math-mps for non-contig Tensors (#139791 ) As MacOS-15 or newer supports those out of the box. This significantly reduces memory requirements and improves performance for some stable diffision networks. Test plan: Run ```python from diffusers import StableDiffusionXLPipeline, AutoencoderKL, EulerAncestralDiscreteScheduler import torch import time vae = AutoencoderKL.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0", subfolder='vae', torch_dtype=torch.bfloat16, force_upcast=False).to('mps') pipe = StableDiffusionXLPipeline.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0", vae=vae, torch_dtype=torch.bfloat16, variant="fp16").to('mps') pipe.scheduler = EulerAncestralDiscreteScheduler.from_config(pipe.scheduler.config) start_time = time.time() start_mps_mem = torch.mps.driver_allocated_memory() image = pipe(prompt="Spherical cow in vacuum", num_inference_steps=10, guidance_scale=8, generator=torch.Generator("mps").manual_seed(42), ).images[0] end_mps_mem = torch.mps.driver_allocated_memory() run_time = time.time() - start_time print(f"run time in {run_time:.2f} sec, end_mps_mem {end_mps_mem/1024.02:.2f} Mb mem increase {(end_mps_mem-start_time)/1024.02:.2f} Mb") image.save(f'bfloat16.png') ``` Before the change total memory use were 16Gb and needed 65 sec to complete, after it drops down to 14Gb and takes 50 sec to finish on M2Pro, though generated image remains the same: ![image](https://github.com/user-attachments/assets/1a35efef-9f80-4cd0-ac9c-30203eab6bb1) Fixes https://github.com/pytorch/pytorch/issues/139389 Pull Request resolved: https://github.com/pytorch/pytorch/pull/139791 Approved by: https://github.com/drisspg, https://github.com/Skylion007 ghstack dependencies: #139788, #139784, #139763	2024-11-06 16:25:39 +00:00
Alan Du	59cf4bc5ae	Fix the use of fsspec transactions (#135541 ) fsspec transactions do not support concurrency and assumes that there is at most 1 running transaction per filesystem. This is not true in our usage, where because of multi-threading we usually have multiple concurrent transactions running at once. Previously, this would just (unsafely) pass but lead to hard-to-debug race conditions (since the commit of one transaction will blow away the state of the other transaction). In fsspec 2024.3.0, trying to commit concurrent transactions will actually crash (see the code at `76ca4a6888/fsspec/transaction.py (L39)` -- because each filesystem can have a single transaction, this tear-down logic will error). Instead, let's manually handle committing / discarding changes to the file. I don't have a minimal test-case, but in Meta this solves a broken test on `fsspec >= 2024.3.0`: Before: https://www.internalfb.com/intern/testinfra/testrun/7318349626774607 After: https://www.internalfb.com/intern/testinfra/testrun/2251800062722633 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135541 Approved by: https://github.com/Skylion007	2024-11-06 15:16:12 +00:00
Nichols A. Romero	641ca67d5a	[ROCM] Fix hipBLASLt version check in TunableOp test (#139811 ) Allow 3 or more digits for hipBLASLt version check in TunableOp test. Needed due to upcoming ROCm 6.3 release. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139811 Approved by: https://github.com/eqy, https://github.com/malfet	2024-11-06 14:37:45 +00:00
Sun, Jiayi	44df6522ee	add Half/BFloat16 support for grid_sample on CPU (#134812 ) Fix https://github.com/pytorch/pytorch/issues/127224. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134812 Approved by: https://github.com/Skylion007, https://github.com/mingfeima	2024-11-06 14:02:08 +00:00
cyy	d558c1a047	Enable cppcoreguidelines-special-member-functions (#139132 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/139132 Approved by: https://github.com/sraikund16	2024-11-06 13:42:20 +00:00
Nikita Shulga	c0c6bf4ef2	Don't use deprecated type properties in UpsampleKernel (#139399 ) By replacing `at::CPU(dtype)` pattern with `at::device(kCPU).dtype(dtype)` pattern Pull Request resolved: https://github.com/pytorch/pytorch/pull/139399 Approved by: https://github.com/Skylion007 ghstack dependencies: #139353	2024-11-06 13:34:45 +00:00
PyTorch MergeBot	44e4949bcf	Revert "[Inductor][CPU] Fuse SmoothQuant int8 linear pattern (#139595 )" This reverts commit 22e89ea2aaa3e0ef0ec4504bd2dbf230447a6d2a. Reverted https://github.com/pytorch/pytorch/pull/139595 on behalf of https://github.com/malfet due to It broke number of tests, see `22e89ea2aa` ([comment](https://github.com/pytorch/pytorch/pull/139595#issuecomment-2459754355))	2024-11-06 13:31:26 +00:00
PyTorch MergeBot	10d7729333	Revert "Enable cppcoreguidelines-special-member-functions (#139132 )" This reverts commit a9b4989c726a29b4b89c64282e32b9e4fc0b7d68. Reverted https://github.com/pytorch/pytorch/pull/139132 on behalf of https://github.com/ZainRizvi due to Sorry but this fails on trunk. See inductor/test_mkldnn_pattern_matcher.py::TestPatternMatcher::test_smooth_quant_with_int_mm [GH job link](https://github.com/pytorch/pytorch/actions/runs/11699366379/job/32591132460) [HUD commit link](`22e89ea2aa`) ([comment](https://github.com/pytorch/pytorch/pull/139132#issuecomment-2459743145))	2024-11-06 13:27:42 +00:00
PyTorch MergeBot	06ad404401	Revert "[BE] And delete `DeprecatedTypProperties` cast (#139358 )" This reverts commit b82a51bc6b1170da3db8f67816799f3a47530ff8. Reverted https://github.com/pytorch/pytorch/pull/139358 on behalf of https://github.com/malfet due to And it was backed out again due to the internal usages of deprecated API ([comment](https://github.com/pytorch/pytorch/pull/139358#issuecomment-2459740090))	2024-11-06 13:23:43 +00:00
PyTorch MergeBot	53299b8a38	Revert "Don't use deprecated type properties in UpsampleKernel (#139399 )" This reverts commit 0058f7100222523fa8b9f74af9ea7d341a6458b4. Reverted https://github.com/pytorch/pytorch/pull/139399 on behalf of https://github.com/malfet due to And it was backed out again due to the internal usages of deprecated API ([comment](https://github.com/pytorch/pytorch/pull/139358#issuecomment-2459740090))	2024-11-06 13:23:43 +00:00
Jack Taylor	5f266b5a02	[ROCm] re-enable flex attention UTs (#139632 ) https://github.com/pytorch/pytorch/pull/136792 accidentally disabled flex attention UTs on ROCm. Re-enabling. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139632 Approved by: https://github.com/drisspg	2024-11-06 12:49:44 +00:00
Michael Lazos	d622b490d6	[Dynamo] Support tensor mro without source (#139838 ) Fixes https://github.com/pytorch/pytorch/issues/137743 The issue here is that if `type` was called on a tensor without a source, we wouldn't have a source even for `torch.Tensor`, and the `__mro__` retrieval would fail. Since `torch.Tensor` is an internal torch type, I add handling for it in `call_type` in builtins. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139838 Approved by: https://github.com/williamwen42	2024-11-06 08:52:53 +00:00
cyy	a9b4989c72	Enable cppcoreguidelines-special-member-functions (#139132 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/139132 Approved by: https://github.com/sraikund16	2024-11-06 07:59:09 +00:00
Xia, Weiwen	22e89ea2aa	[Inductor][CPU] Fuse SmoothQuant int8 linear pattern (#139595 ) About the PR In the implementation of SmoothQuant in Torchao, quantized linear is computed by `_int_mm(a, b)` + `mul(b_scale)` + `mul(a_scale)` (+ optional `add` for bias) with `reshape` and `convert_dtype` in between. This PR adds a pass to fuse the corresponding patterns: - (no bias) `reshape -> _int_mm -> convert_element_type -> (expand -> mul) -> mul -> reshape` - (with bias) `pattern_no_bias -> add -> reshape -> reshape` The patterns are replaced by `onednn.qlinear_pointwise` and `onednn.qlinear_prepack`, the latter of which is evaluated and frozen during the freezing process of Inductor. The final graph contains `onednn.qlinear_pointwise` only with packed weight constants. Note that `onednn.qlinear_pointwise` does not support per-channel quantization of activation, which is a limitation of oneDNN library, so in that case we set activation scale to 1 and bias to none and apply scales and add bias after `onednn.qlinear_pointwise`. Validation results Accuracy/perplexity is not changed with or without this fusion pass. Latency is improved by >10% with the fusion pass. Test method: - Model: EleutherAI/gpt-j-6b - Hardware: Intel(R) Xeon(R) Platinum 8490H, running on 1 socket, 60 cores - Using Intel OMP and Tcmalloc - Running [the example script of SmoothQuant in Torchao](https://github.com/pytorch/ao/blob/main/torchao/prototype/smoothquant/example.py) with `TORCHINDUCTOR_FREEZING=1 numactl -N1 python example.py -m EleutherAI/gpt-j-6b --device=cpu --quant-mode=dynamic --compile` Test plan ``` python test/inductor/test_mkldnn_pattern_matcher.py -k test_smooth_quant_with_int_mm ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/139595 Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5, https://github.com/jerryzh168	2024-11-06 07:54:47 +00:00
Thanh Ha	d031d1bf4c	Update to upload-artifacts and download-artifacts to v4 (#139808 ) The 2 actions actions/download-artifact@v3 and actions/upload-artifact@v3 will be deprecated December 5th, 2024. This change updates them to using v4. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139808 Approved by: https://github.com/seemethere	2024-11-06 05:57:41 +00:00
Nikita Shulga	157c18a180	[BE][Attention] Use `isneginf` (#139763 ) May be I'm missing some vital piece of information, but it feels like ```c++ const auto neg_inf = at::scalar_tensor(-std::numeric_limits<float>::infinity(), at::TensorOptions().dtype(out.dtype()).device(out.device())); const auto masked = self.eq(neg_inf); ``` should be equivalent to [`torch.isneginf`](https://pytorch.org/docs/stable/generated/torch.isneginf.html) call Pull Request resolved: https://github.com/pytorch/pytorch/pull/139763 Approved by: https://github.com/Skylion007 ghstack dependencies: #139788, #139784	2024-11-06 04:32:37 +00:00
Judicaël Clair	1c63612567	Fix & unit test for `c10::ArrayRef` constructed from user-defined types (#139758 ) Fixes #139391 Pull Request resolved: https://github.com/pytorch/pytorch/pull/139758 Approved by: https://github.com/ezyang	2024-11-06 04:23:05 +00:00
Shuqiang Zhang	d35a600b74	[pgnccl] skip restart test fro rocm (#139809 ) Summary: PG restart test is flaky in rocm: https://github.com/pytorch/pytorch/pull/139809, skip the AMD/ROCM test for now Test Plan: CI Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/139809 Approved by: https://github.com/kwen2501	2024-11-06 04:17:29 +00:00
Nikita Shulga	96ca17fec4	[CD] Move linux-aarch64 build scripts (#139815 ) All files in `.ci/aarch64_linux` folder are from `88590cd635/aarch64_linux` Companion PR to delete `aarch64_linux` folder in builder: https://github.com/pytorch/builder/pull/2030 Pull Request resolved: https://github.com/pytorch/pytorch/pull/139815 Approved by: https://github.com/wdvr, https://github.com/huydhn	2024-11-06 04:16:48 +00:00
Huy Do	c19c384690	Fix torch.load (torch.utils.benchmark) after #137602 (#139810 ) After #137602, the default `weights_only` has been set to True. This test is failing in trunk slow jobs atm benchmark_utils/test_benchmark_utils.py::TestBenchmarkUtils::test_collect_callgrind [GH job link](https://github.com/pytorch/pytorch/actions/runs/11672436111/job/32502454946) [HUD commit link](`1aa71be56c`) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139810 Approved by: https://github.com/kit1980	2024-11-06 03:08:29 +00:00
Colin Peppler	63b01f328e	[inductor] support masked_scatter w/ unbacked sized source (#138083 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138083 Approved by: https://github.com/jansel	2024-11-06 02:16:25 +00:00
cyy	028c5d3426	[2/N] Replace c10::sv with std::sv (#139456 ) Follows #139453 Pull Request resolved: https://github.com/pytorch/pytorch/pull/139456 Approved by: https://github.com/ezyang	2024-11-06 01:50:38 +00:00
Andrew Gu	39ede99a33	Add current FSDP2 path to old composable FSDP1 warning (#139759 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139759 Approved by: https://github.com/weifengpy, https://github.com/wz337 ghstack dependencies: #139650	2024-11-06 01:43:04 +00:00
Nikita Shulga	bd45c00fde	[BE][Attention] Code de-dup (#139784 ) The only difference between `convert_boolean_attn_mask_cudnn` and `convert_boolean_attn_mask` is the value we initialize boolean tensor to Reduce duplication by introducing `convert_boolean_attn_mask_` that takes `neg_inf` value and make abovementioned implementations are trivial oneline call Also, as suggested by @Skylion007, replace `at::where(foo->logical_not, -inf, 0)` with `at::where(*foo, 0, -inf)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/139784 Approved by: https://github.com/Skylion007, https://github.com/drisspg ghstack dependencies: #139788	2024-11-06 01:33:19 +00:00
David Berard	aec179e2be	Fix docs for logcumsumexp formula (#139768 ) The previous formula was wrong and reused some indexing variables. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139768 Approved by: https://github.com/janeyx99	2024-11-06 01:19:09 +00:00
Laith Sakka	a787320d0f	Do not try to optimize new implications in get_implications (#139738 ) Summary: save around 8% on the torchrec model. In most case the new implications are not optimizaiton anyway in some case though they are, but optimizing them is useless. ex: ``` generating implications for Eq(Mod(s0, 3), 0) adding Eq(Mod(s0, 3), 0) adding Eq(0, Mod(s0, 3)) adding Ne(Mod(s0, 3), 0) adding Ne(0, Mod(s0, 3)) adding Mod(s0, 3) <= 0 adding 0 < Mod(s0, 3) adding True adding False ``` VS ``` generating implications for Eq(Mod(s0, 3), 0) adding Eq(Mod(s0, 3), 0) adding Eq(0, Mod(s0, 3)) adding Ne(Mod(s0, 3), 0) adding Ne(0, Mod(s0, 3)) adding Mod(s0, 3) <= 0 adding 0 < Mod(s0, 3) adding 0 <= Mod(s0, 3) adding Mod(s0, 3) < 0 ``` the main difference is that 0 <= Mod(s0, 3) can be simplified to True and Mod(s0, 3) < 0 to False but with this change this wont happen. but True:True and False: False are useless anyway lol. so its ok i think ``` buck2 run fbcode//mode/opt fbcode//torchrec/distributed/tests:pt2_compile_benchmark -- --num-features=1000 ``` <img width="1082" alt="Screenshot 2024-11-04 at 9 25 51 PM" src="https://github.com/user-attachments/assets/a26e291b-9280-4b55-9275-f3201a36ac51"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/139738 Approved by: https://github.com/ezyang ghstack dependencies: #139703	2024-11-06 00:23:40 +00:00
Will Feng	6a30c14a0a	[Traceable FSDP2] Run any unexecuted post_backward at beginning of pre_backward hook (#139671 ) Assuming the forward pass user code looks like: ``` for _ in range(2): x = layer(x) ``` and we have `fully_shard(layer)`, then: - the forward pass will be like: "unshard layer -> call layer 1st time -> reshard layer -> unshard layer -> call layer 2nd time-> reshard layer" (currently same for both eager and compile) - the backward pass will be like: "unshard layer -> call layer 1st time -> reshard layer -> unshard layer -> call layer 2nd time-> reshard layer" in eager, but currently it's "unshard layer -> call layer 1st time -> call layer 2nd time -> reshard layer" in compile The behavior in the backward pass is different between eager and compile, which is not ideal. I am currently trying to look for a way to fix this non-ideal behavior of compile - tried a few things: 1. Tracing the RegisterPostBackwardFunction custom autograd function - this stills seems to be a no-go, due to HOP not supporting side-effects. 2. Instead of custom autograd function, do a "multi-grad hook" to wait for all gradients to be ready before triggering post_backward. However, this approach seems to have bad interaction with register_hook of pre_backward, in the sense that it's unclear which of them will be triggered first in practice. 3. Force execute any pending post_backward before unshard in pre_backward hook, and rely on compiler to move the reshard to the right place to optimize peak memory. -> This PR Pull Request resolved: https://github.com/pytorch/pytorch/pull/139671 Approved by: https://github.com/awgu	2024-11-06 00:19:06 +00:00
Xiaodong Wang	e7cf7d00be	Support torch.bool in torch.sort + CUDA (#139409 ) Summary: This might be out-dated, so I'm adding it back and see if we pass all the tests. I'm pretty sure cuda12 is ok. Test Plan: CI Differential Revision: D65282650 Pull Request resolved: https://github.com/pytorch/pytorch/pull/139409 Approved by: https://github.com/zou3519, https://github.com/ngimel, https://github.com/eqy	2024-11-06 00:02:54 +00:00
Aaron Orenstein	06f619d999	typing ir.py - part 2 (#131846 ) See #131852 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131846 Approved by: https://github.com/eellison ghstack dependencies: #139238	2024-11-06 00:01:15 +00:00
Aaron Orenstein	c2109ec479	typing ir.py - Disallow untyped defs for ir.py (#139238 ) - Remove "mypy: allow-untyped-defs" and mark functions individually with "no-untyped-def" - Mark some trivial functions with the proper return types (`None` and `torch.dtype`) - Fixed a type bug in the signature of supported_dtype_of_cpp_wrapper() - `ruff check torch/_inductor/ir.py --select ANN --fix --unsafe-fixes` and then fixed up things that looked incorrectly applied. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139238 Approved by: https://github.com/Skylion007, https://github.com/ezyang	2024-11-06 00:01:15 +00:00
leslie-fang-intel	82e4de4994	[Inductor][CPU] Enable the oneDNN Linear fusion for special case (#139172 ) Summary In the case of LLaMA2, for a linear operation with an activation size of `(4, 1, 4096)` and a stride of `(4096, 128, 1)` which has been decomposed into `matmul`. And the decomposition of `matmul` results in `bmm` due to a strict continuity check. We can align the continuity check with ATen by skip dim of size 1 to enable decomposition into `mm` instead. Test Plan ``` python -u -m pytest -s -v test/inductor/test_mkldnn_pattern_matcher.py -k test_linear_input_non_contiguous_3D_wo_bias ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/139172 Approved by: https://github.com/jgong5, https://github.com/ezyang	2024-11-05 23:49:53 +00:00
Thomas Bohnstingl	d1c26b0781	Improvements for associative_scan - slicing of xs (#138858 ) In this PR, the combine_fn is consistently called with a slice along the scan dim. It implements part of https://github.com/pytorch/pytorch/pull/136966 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138858 Approved by: https://github.com/ydwu4	2024-11-05 23:38:21 +00:00
Nikita Shulga	eec153a69c	[BE][Attention] Factor out common code (#139788 ) - Compute attention mask before the switch - Introduce `query_device_type` variable - Refactor some of MPS-math checks into easily readable boolean names Pull Request resolved: https://github.com/pytorch/pytorch/pull/139788 Approved by: https://github.com/Skylion007, https://github.com/drisspg	2024-11-05 23:27:18 +00:00
Tongzhou Wang	faab564bda	[doc] Fix grammar in export.ir_spec.rst (#139584 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139584 Approved by: https://github.com/zou3519	2024-11-05 23:26:36 +00:00
Mikayla Gawarecki	86d7d39bff	Forward fix D65441551 for T206731737 (#139767 ) Test Plan: - Differential Revision: D65482429 Pull Request resolved: https://github.com/pytorch/pytorch/pull/139767 Approved by: https://github.com/awgu	2024-11-05 23:19:08 +00:00
Shuqiang Zhang	c0d642a295	[pgnccl][simple] log started work numel (#139773 ) Summary: We saw some cases that the same work was started on multiple ranks, but did not complete. This info could give us more info if the numel matches Test Plan: CI Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/139773 Approved by: https://github.com/Skylion007, https://github.com/kwen2501	2024-11-05 23:11:19 +00:00
PyTorch MergeBot	1d28b8b6d5	Revert "Deprecate `torch._utils.is_compiling()` and `torch._dynamo.external_utils.is_compiling()` (#127690 )" This reverts commit e84d1121ad66a453c8c24fcc098625e2e9764fca. Reverted https://github.com/pytorch/pytorch/pull/127690 on behalf of https://github.com/ZainRizvi due to Sorry but this is breaking internally. More details in D65483292 ([comment](https://github.com/pytorch/pytorch/pull/127690#issuecomment-2458381056))	2024-11-05 23:10:38 +00:00
wz337	f63ee13f2c	[Test][DTensor] Skip test_dtensor_mm if ROCm (#139719 ) Seems there are some numeric issues when running on ROCm. ``` PYTORCH_TEST_WITH_ROCM=1 python test/distributed/_tensor/test_matrix_ops.py DistMatrixOpsTest.test_dtensor_mm ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/139719 Approved by: https://github.com/XilunWu	2024-11-05 22:56:35 +00:00
drisspg	16da289402	[Workspace Inductor] Fix dynamic shapes (#139777 ) # Summary Arg ordering was wrong for when dynamic shapes is enabled and we pass in the additional size args Pull Request resolved: https://github.com/pytorch/pytorch/pull/139777 Approved by: https://github.com/eellison ghstack dependencies: #139157	2024-11-05 22:34:09 +00:00
Sam Ginzburg	d26dcda35e	[test] Fix Triton test to use the correct divisibility attr (#139772 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139772 Approved by: https://github.com/bertmaher	2024-11-05 22:28:18 +00:00
Animesh Jain	b09eb6ed6a	[dynamo][guards] Consider tensors as immutable for dict tag matches (#139560 ) This is a bug on the main exposed by https://github.com/pytorch/pytorch/issues/139476 We have dict tag optimization where if the dict tag does not change, we skip guards on all the items of the dict that are "immutable". We considered tensors as immutable in such scenarios. This is critical for guard eval performance, because generally users dont change their parameters. If I try to remove this optimization, we see slowdowns, e.g, 3.03x to 2.95x on conv_mixer TIMM benchamrk. So, I am adding a flag which keeps the current state but allows the users to remove this optimization. Not ideal, but given how serious guard eval perf has to be, we are in the gray are of unsoundness vs performance tradeoff. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139560 Approved by: https://github.com/jansel	2024-11-05 21:48:07 +00:00
Howard Huang	75eeefbfab	[pp] pipelining + dcp unit test (#139633 ) Currently there aren't any unit tests for PP and DCP, this unit test could be useful for quick experimentation in issues like (https://github.com/pytorch/torchtitan/issues/474). `python test/distributed/_composable/test_composability/test_pp_composability.py -k test_pp_and_dcp` Pull Request resolved: https://github.com/pytorch/pytorch/pull/139633 Approved by: https://github.com/wconstab	2024-11-05 21:02:11 +00:00
Jun Luo	1a70185309	Add Autograd Fallback for MTIA (#139211 ) Summary: As title. Test Plan: OSS and internal CIs. Differential Revision: D65022481 Pull Request resolved: https://github.com/pytorch/pytorch/pull/139211 Approved by: https://github.com/jvandebon	2024-11-05 20:58:21 +00:00
Jean Schmidt	59b66944d4	Migrate inductor-perf-test-nightly.yml to use linux.aws.a100 (#139657 ) Co-authored-by: Huy Do <huydhn@gmail.com>	2024-11-05 21:24:28 +01:00
Yidi Wu	6734cb7bf2	[hop free symbols] refactor tensor.to_list implementation to call wrap_fx_proxy. (#139663 ) Refactoring only. Previously, we manually cal SymNodeVariable.create, now we handle it with wrap_fx_proxy. This unifies the handling of operations that produce symints in wrap_fx_proxy. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139663 Approved by: https://github.com/zou3519 ghstack dependencies: #138345, #138428, #138558, #138737, #138559	2024-11-05 20:19:09 +00:00
Ting Lu	ae86939425	[aarch64] add CUDA 12.6 to docker for sbsa wheel (#138562 ) Add cuda 12.6 installation for sbsa docker Related to #138440 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138562 Approved by: https://github.com/atalman	2024-11-05 20:15:51 +00:00
Chirag Pandya	d549ddfb14	[fr][rfc] use a logger to control output for flight recorder analyzer (#139656 ) Summary: Use a logger to control output to console. This is useful for hiding out debug/detail messages from the console v/s showing everything together. Test Plan: Ran `torchfrtrace` with various switches. The `-v` verbose swtch ``` torchfrtrace --prefix "trace_" /tmp/ -v loaded 2 files in 0.2567298412322998s built groups, memberships Not all ranks joining collective 3 at entry 2 group info: 0:default_pg collective: nccl:all_reduce missing ranks: {1} input sizes: [[4, 5]] output sizes: [[4, 5]] expected ranks: 2 collective state: scheduled collective stack trace: <module> at /home/cpio/test/c.py:66 appending a non-matching collective built collectives, nccl_calls Groups id desc size -------------------- ---------- ------ 09000494312501845833 default_pg 2 Memberships group_id global_rank -------------------- ------------- 09000494312501845833 0 09000494312501845833 1 Collectives id group_id ---- ---------- 0 0 1 0 NCCLCalls id collective_id group_id global_rank traceback_id collective_type sizes ---- --------------- ---------- ------------- -------------- ----------------- -------- 0 0 0 0 0 nccl:all_reduce [[3, 4]] 1 0 0 1 0 nccl:all_reduce [[3, 4]] 2 1 0 0 0 nccl:all_reduce [[3, 4]] 3 1 0 1 0 nccl:all_reduce [[3, 4]] 4 0 0 0 nccl:all_reduce [[4, 5]] ``` Without the verbose switch ``` ❯ torchfrtrace --prefix "trace_" /tmp/ Not all ranks joining collective 3 at entry 2 group info: 0:default_pg collective: nccl:all_reduce missing ranks: {1} input sizes: [[4, 5]] output sizes: [[4, 5]] expected ranks: 2 collective state: scheduled collective stack trace: <module> at /home/cpio/test/c.py:66 ``` With the `-j` switch: ``` ❯ torchfrtrace --prefix "trace_" /tmp/ -j Rank 0 Rank 1 ------------------------------------------------- ------------------------------------------------- all_reduce(input_sizes=[[3, 4]], state=completed) all_reduce(input_sizes=[[3, 4]], state=completed) all_reduce(input_sizes=[[3, 4]], state=completed) all_reduce(input_sizes=[[3, 4]], state=completed) all_reduce(input_sizes=[[4, 5]], state=scheduled) ``` Differential Revision: D65438520 Pull Request resolved: https://github.com/pytorch/pytorch/pull/139656 Approved by: https://github.com/fduwjj	2024-11-05 20:14:18 +00:00
rzou	b9f0563aaf	Add repro instructions to fx_graph_runnable.py (#139481 ) This PR adds some instructions for how to add a TARGETS file to run the fx_graph_runnable script. I'm planning to add some followups that will add additional imports for custom ops and use autodeps to get the dependencies, but I figure this PR is an easy first step. Test Plan: - pytest test/dynamo/test_structured_trace.py - Does anyone have suggestions for how to test this? Pull Request resolved: https://github.com/pytorch/pytorch/pull/139481 Approved by: https://github.com/eellison	2024-11-05 19:24:16 +00:00
Ryan Guo	01bcf37123	[dynamo][NFC] Remove some dead code paths (#139674 ) As title. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139674 Approved by: https://github.com/Skylion007, https://github.com/anijain2305, https://github.com/mlazos	2024-11-05 19:12:17 +00:00
Ryan Guo	2b3a227b35	[dynamo] Add `is_mutable()` and `is_immutable()` methods to `VariableTracker` (#139341 ) This patch adds 2 simple methods `VariableTracker.is_mutable()` and `VariableTracker.is_immutable()`, which helps clarify intention. For instance, rather than writing ```python if var.mutation_type: ... ``` After this patch one can write ```python if var.is_mutable(): ... ``` This patch also simplifies `mutation_type` propagation in some `ListVariable` methods. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139341 Approved by: https://github.com/mlazos, https://github.com/anijain2305 ghstack dependencies: #139339, #139340	2024-11-05 19:11:41 +00:00
Ryan Guo	0ba3962b80	[dynamo][NFC] Move `MutationType` classes into `variables/base.py` (#139340 ) As title, this addresses https://github.com/pytorch/pytorch/pull/137905/files#r1806800222. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139340 Approved by: https://github.com/anijain2305 ghstack dependencies: #139339	2024-11-05 19:11:41 +00:00
Ryan Guo	693a0a1bd4	[dynamo][NFC] Rename `mutable_local` and add documentation (#139339 ) This patch addresses the renaming part of #133027, specifically, it renames the following and adds documentation for relevant classes. 1. `VariableTracker.mutable_local` to `mutation_type` 2. `MatableLocal `to `ValueMutationNew` 3. `MutableSideEffects `to `ValueMutationExisting` 4. `MutableLocalSource` to `SourceType` 5. `MutableLocalSource.Local` to `New` Note that (2), (3) and (5) are mainly to bring consistency between them and `AttributeMutationNew`, `AttributeMutationExisting`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139339 Approved by: https://github.com/jansel, https://github.com/mlazos, https://github.com/anijain2305	2024-11-05 19:11:41 +00:00
Ke Wen	5f2ed505eb	[PGNCCL] Watchdog prints call-time traceback when reporting timeout (#139659 ) ### Motivation Today, watchdog only reports that it found a collective timeout: ``` [rank1]:[E1104 14:02:18.767594328 ProcessGroupNCCL.cpp:688] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=200, NumelOut=200, Timeout(ms)=5000) ran for 5096 milliseconds before timing out. ``` While this is nice, it is hard to associate the error with user's program or library stack. ### This PR This PR gives watchdog the ability to report the call-time stack of the collective, so that it would be easier to track the error back to the program's behavior. The call-time stack was recorded by Flight Recorder with minimal overhead (for details, please read this [doc](https://dev-discuss.pytorch.org/t/fast-combined-c-python-torchscript-inductor-tracebacks/1158) written by @zdevito ). In `ProcessGroupNCCL`, we are only tracking / reporting the python part so that it fits most PyTorch users. ### Demo [stack_demo.py](https://gist.github.com/kwen2501/6758e18d305d67fc6f3f926217825c09). ``` TORCH_NCCL_TRACE_BUFFER_SIZE=100 torchrun --nproc-per-node 2 stack_demo.py ``` `TORCH_NCCL_TRACE_BUFFER_SIZE` is for turning on the Flight Recorder. Output: ``` [rank0]:[E1104 14:19:27.591610653 ProcessGroupNCCL.cpp:695] Stack trace of the timedout collective operation: #0 all_reduce from /data/users/kw2501/pytorch/torch/distributed/distributed_c10d.py:2696 #1 wrapper from /data/users/kw2501/pytorch/torch/distributed/c10d_logger.py:83 #2 bar from /data/users/kw2501/sync_async/repro.py:15 #3 foo from /data/users/kw2501/sync_async/repro.py:24 #4 main from /data/users/kw2501/sync_async/repro.py:34 #5 <module> from /data/users/kw2501/sync_async/repro.py:40 [rank1]:[E1104 14:19:27.771430164 ProcessGroupNCCL.cpp:695] Stack trace of the timedout collective operation: #0 all_gather_into_tensor from /data/users/kw2501/pytorch/torch/distributed/distributed_c10d.py:3630 #1 wrapper from /data/users/kw2501/pytorch/torch/distributed/c10d_logger.py:83 #2 baz from /data/users/kw2501/sync_async/repro.py:20 #3 foo from /data/users/kw2501/sync_async/repro.py:26 #4 main from /data/users/kw2501/sync_async/repro.py:34 #5 <module> from /data/users/kw2501/sync_async/repro.py:40 ``` From the log above, we can tell that `bar()` and `baz()` are the places where the two ranks divert. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139659 Approved by: https://github.com/wconstab, https://github.com/fduwjj	2024-11-05 19:07:17 +00:00
Yifu Wang	ee42a99745	[SymmetricMemory] introduce a binding for cuMemset32Async (#138755 ) ## This Stack This stack does the following things to support `xformers`-style, comm-aware Triton kernels: - Exposes `signal_pad`s as tensors in Python - Adds a binding for `cuMemsetAsync` These in combination aims to provide users with more flexibility to express custom signaling/synchronization patterns. ## This PR Make `cuMemset32Async` available via `_SymmetricMemory.memset32`. We chose `cuMemset32Async` over `cudaMemsetAsync` because it allows for `uint32_t`-wise memset. This provides users with better flexibility. To enable this, we also added the following cuda driver APIs in `c10::cuda::DriverAPI`: - `cuDevicePrimaryCtxRetain` - for obtaining the primary context of a device in the form of `CUcontext`. - `cuCtxGetCurrent`/`cuCtxSetCurrent` - for setting and restoring the context for cuda driver APIs such as `cuMemset32Async`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138755 Approved by: https://github.com/weifengpy, https://github.com/eqy, https://github.com/lw	2024-11-05 18:47:24 +00:00
Boyuan Feng	87059d4547	[AOTAutograd] Handle edge cases for donated buffer & enable in oss (#139669 ) This PR enables donated buffer in OSS and handles two edge cases: 1. While donated buffer relies on storage to check alias, sparse tensor subclasses does not provide access to storage. So we skip sparse tensor subclasses for donated buffer. 2. Handles missing "val" from n.meta. This is observed from `inductor/test_fused_attention.py::SDPAPatternRewriterCpuTests::test_sdpa_rewriter_11_cpu`, `functorch/test_aotdispatch.py::TestAOTAutograd::test_input_mutation_simple_with_none_and_nontensor`, and `inductor/test_compiled_autograd.py::TestCompiledAutograd::test_trace_run_with_rng_state`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139669 Approved by: https://github.com/bdhirsh	2024-11-05 18:38:20 +00:00
rzou	27ec3921bc	Optimize mutable torch.library.custom_op overhead (#139513 ) We don't need to do a loop over all the args, kwargs in the AdInplaceOrView key; we just need to bump the version on the args, kwargs that are mutable. On the benchmark mentioned in https://github.com/pytorch/pytorch/issues/139494 this made the time go from ``` mutate2 = 61.72943878173828 no_mutate2 = 36.89440155029297 mutate = 236.3092498779297 no_mutate = 59.31964874267578 ``` to ``` mutate2 = 47.976478576660156 no_mutate2 = 38.37468719482422 mutate = 71.21315002441406 no_mutate = 59.7432975769043 ``` Test Plan: - existing tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/139513 Approved by: https://github.com/bdhirsh ghstack dependencies: #139509	2024-11-05 18:30:53 +00:00
Tomasz Bohutyn	9dc5851f5d	handle more devices in method_type method of TensorVariable (#138078 ) Fixes #138077 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138078 Approved by: https://github.com/jgong5, https://github.com/ezyang	2024-11-05 18:19:52 +00:00
Angela Yi	de509abe1c	[export] Dedup data-dependent errors based on stacktrace (#139540 ) Summary: Dedup the data-dependent errors based on the stacktrace it points to. Right now we just display every propagate-real-tensor log that shows up, but we actually can dedup them if they are due to the same piece of code (ex. there could multiple calls to a piece of code that does some data dependent computation). This occurred when trying out draft export on the PT2I model zoo. For a specific model, previously we would get ~3k data dependent errors, but after deduping based on the stacktrace we now only get 4 errors. Test Plan: CI Differential Revision: D65374254 Pull Request resolved: https://github.com/pytorch/pytorch/pull/139540 Approved by: https://github.com/pianpwk, https://github.com/zou3519	2024-11-05 18:16:05 +00:00
Sam Ginzburg	cc25b6d7ba	[inductor] Error on unsupported autotuner configs (#139658 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139658 Approved by: https://github.com/aakhundov	2024-11-05 18:09:02 +00:00
Junjie Wang (PyTorch)	41e4d88584	[logging][ez] Add timer logging for pickling and unpickle for object based collective (#139757 ) Summary: As discussed, we want to measure the time spent during pickling and unpickle. Test Plan: CI Reviewed By: wz337 Differential Revision: D65462767 Pull Request resolved: https://github.com/pytorch/pytorch/pull/139757 Approved by: https://github.com/awgu, https://github.com/Skylion007, https://github.com/fegin, https://github.com/c-p-i-o	2024-11-05 17:40:27 +00:00
atalman	5860c8ebd1	Use Manylinux2_28 for wheel builds (#138732 ) Fixes https://github.com/pytorch/pytorch/issues/123649 Use Manylinux 2_28 Docker builds for PyTorch Nightly builds This moves the wheels to a Docker image that uses : ``quay.io/pypa/manylinux_2_28_x86_64`` as a base rather then ``centos:7`` which is EOL on June 30, 2024. Information: https://github.com/pypa/manylinux#manylinux_2_28-almalinux-8-based manylinux_2_28 (AlmaLinux 8 based) Toolchain: GCC 13 Built wheels are also expected to be compatible with other distros using glibc 2.28 or later, including: Debian 10+ Ubuntu 18.10+ Fedora 29+ CentOS/RHEL 8+ This migration should enable us to migrate to latest CUDNN version, and land this PR: https://github.com/pytorch/pytorch/pull/137978 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138732 Approved by: https://github.com/Skylion007, https://github.com/malfet	2024-11-05 17:21:24 +00:00
Oguz Ulgen	c0d21b6581	End TritonBundle on non-cache write codepaths (#139698 ) Summary: When we bypass cache write on inductor, we were also forgetting to reset the bundle, this moves resetting the bundle into post_compile step so it gets uniformly reset. This diff also turns on the cache for internal so that we can do a code rollout. Test Plan: updated tests Differential Revision: D65457224 Pull Request resolved: https://github.com/pytorch/pytorch/pull/139698 Approved by: https://github.com/ezyang	2024-11-05 17:00:40 +00:00
PyTorch MergeBot	4d5cc1b4ef	Revert "[dynamo][guards] Consider tensors as immutable for dict tag matches (#139560 )" This reverts commit e6ff07f00e04a9b58efb86a3dd70ed7280ae8522. Reverted https://github.com/pytorch/pytorch/pull/139560 on behalf of https://github.com/ZainRizvi due to Sorry but this seems to be breaking internal tests. Please see D65430317 for more details ([comment](https://github.com/pytorch/pytorch/pull/139560#issuecomment-2457620720))	2024-11-05 16:22:30 +00:00
cyy	a2bc2e38f9	Use clang-tidy 17 (#139678 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/139678 Approved by: https://github.com/Skylion007	2024-11-05 16:00:25 +00:00
Edward Z. Yang	e0156f9faa	HACK: use FB proxy for testowners (#139473 ) I got fed up with this always timing out when I didn't have correct proxy settings. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/139473 Approved by: https://github.com/malfet	2024-11-05 15:35:41 +00:00
Junjie Wang (PyTorch)	13eb3b3f6f	[Torch Elastic] Fix the bug caused by wrong host address in creating TCPStore server inside dynamic rendezvous (#139702 ) Summary: During dynamic rendezvous, we shouldn't use the address from the store but just use `self._this_node.addr` directly because sometimes, the store host is not the host of rank0. Passing wrong host will cause timeout error. This is a follow up fix to S463164, for internal tests, we disable the TCPStore sharing for now. Test Plan: CI. Differential Revision: D65453312 Pull Request resolved: https://github.com/pytorch/pytorch/pull/139702 Approved by: https://github.com/XilunWu	2024-11-05 15:28:03 +00:00
Wei Wang	53f164cae5	[CUDA][CI][cusparselt] Only CUDA 11.8 ships the libcusparseLt.so.0, CUDA 12 would use PYPI libcusparselt (#138547 ) since nvidia-cusparselt-cu12 is available and nvidia-cusparselt-cu11 is not available Related: #138175 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138547 Approved by: https://github.com/atalman	2024-11-05 15:12:41 +00:00
Edward Z. Yang	349cd49406	Fix compiler collective TORCH_TRACE and improve code state printing (#139716 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/139716 Approved by: https://github.com/yf225	2024-11-05 14:32:52 +00:00
cyy	546318e559	[7/N] Don't skip ASAN on some tests (#139675 ) Follows #139565 Pull Request resolved: https://github.com/pytorch/pytorch/pull/139675 Approved by: https://github.com/ezyang	2024-11-05 14:01:01 +00:00
Robert Hardwick	f551d90552	Fix for gcc10 torch.compile compiler error when march=aarch64+sve (#137795 ) Disable tree vectorize in vec_convert.h for gcc10 and aarch64+sve which causes compiler error to occur. ``` /tmp/tmpuqk7lj9j/zx/czx2eyturb6j6m727xhvknkjbdu3y5nqqk66wgxcjkwnxuzvpm5r.cpp:3:18: internal compiler error: in vect_get_vector_types_for_stmt, at tree-vect-stmts.c:12252 3 \| extern "C" void kernel(const float* in_ptr0, ``` Fixes #137775 I've not linked a gcc bug report yet as they require a minimal reproducer to be made. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137795 Approved by: https://github.com/malfet	2024-11-05 12:46:42 +00:00
Xuehai Pan	e84d1121ad	Deprecate `torch._utils.is_compiling()` and `torch._dynamo.external_utils.is_compiling()` (#127690 ) This PR is split from PR #126898. - #126898 ------ Pull Request resolved: https://github.com/pytorch/pytorch/pull/127690 Approved by: https://github.com/Skylion007, https://github.com/malfet	2024-11-05 10:44:56 +00:00
zeshengzong	ffb7a08921	Fix torch.histc not checking min > max on cuda for int8 tensors (#139372 ) Fixes #139360 `86e6513c86/aten/src/ATen/native/cuda/SummaryOps.cu (L323-L324)` Assign `min` and `max` to with low-precision input_t variable `minvalue` and `maxvalue` cause wrong comparing result in following check in here: `86e6513c86/aten/src/ATen/native/cuda/SummaryOps.cu (L353)` ![image](https://github.com/user-attachments/assets/0d5c87f4-3dc6-48bb-bcc8-b1803e7cd487) Change type of `minvalue` and `maxvalue` to fix it, similar like in line: `86e6513c86/aten/src/ATen/native/cuda/SummaryOps.cu (L280-L282)` Test Result ```bash $ pytest test/test_reductions.py -vv ``` ![image](https://github.com/user-attachments/assets/6b5d0d48-ebc2-4a8c-85f4-dbad147c086c) ```bash $ lintrunner ``` ![image](https://github.com/user-attachments/assets/f97c2d6d-78ea-4439-a1ba-907bc9defad7) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139372 Approved by: https://github.com/eqy	2024-11-05 08:42:38 +00:00
xinan.lin	356fc41ae0	[Intel GPU] Avoid target_link_libraries twice for torch_xpu_ops which will potentially cause multiple definition symbol linker error. (#139024 ) [Intel GPU] Avoid target_link_libraries twice for torch_xpu_ops which will potentially cause multiple definition symbol linker error. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139024 Approved by: https://github.com/EikanWang, https://github.com/fengyuan14, https://github.com/jansel	2024-11-05 08:18:09 +00:00
Laith Sakka	6ad52db8c8	use torch.sym_sum instead of incremental sum in _cat_meta (#139653 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139653 Approved by: https://github.com/ezyang	2024-11-05 07:24:24 +00:00
Aaron Orenstein	51a3d6dbc3	Fix existing lint issues in ir.py (#139237 ) - Remove stale mypy "type: ignores" - Made ir.py pass the rest of the lints Pull Request resolved: https://github.com/pytorch/pytorch/pull/139237 Approved by: https://github.com/Skylion007	2024-11-05 06:06:12 +00:00
Eli Simhayev	b2f5a5311b	RMSNorms docs - remove biases initialization (#139620 ) RMSNorm doesn't use a bias in `elementwise_affine`, so I've removed it from the documentation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139620 Approved by: https://github.com/mikaylagawarecki	2024-11-05 05:59:41 +00:00
Chen, Zejun	9aaf3a04fa	[profiler][UT] instantiate profiler UTs for devices and enable UTs for xpu profiler (#134316 ) This PR enables the profiler related UT to be device-agnostic. It instantiates the profiler UTs for different device types and enable them on XPU backend. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134316 Approved by: https://github.com/etaf, https://github.com/aaronenyeshi, https://github.com/gujinghui	2024-11-05 05:46:13 +00:00
Laith Sakka	de4216bfda	increase add_loop benchmark and refresh all results! (#139703 ) see comments end of https://github.com/pytorch/pytorch/pull/138756 I am also refreshing all values Pull Request resolved: https://github.com/pytorch/pytorch/pull/139703 Approved by: https://github.com/bobrenjc93	2024-11-05 05:41:21 +00:00
CaoE	9e14d86573	[Inductor][CPP] Add oneDNN BRGEMM config for Half cpp gemm template (#136255 ) `kernel_micro_gemm` generated using BRGEMM: ``` template <bool accum> inline void kernel_micro_gemm( const half* __restrict__ A, const half* __restrict__ B, float* __restrict__ C, int64_t M, int64_t N, int64_t K, int64_t lda, int64_t ldb, int64_t ldc ) { at::native::cpublas::brgemm( M, N, K, lda, ldb, ldc, 1.f, accum ? 1.f : 0.f, A, B, C); } ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/136255 Approved by: https://github.com/jgong5, https://github.com/jansel	2024-11-05 05:33:29 +00:00
Meet Vadakkanchery	c8a55eea88	[DCP] Fix process_group logging for DCP methods (#139428 ) Summary: Currently, we incorrectly log process_group for DCP based events. We rely on [c10d_logger.py](https://fburl.com/v4mdme9z) to fill in information about process_group (e.g. backend, nccl_version if available). In [checkpoint/logger.py](https://fburl.com/yho9nqbu) we pass the `msg_dict` to c10d_logger which never contains the `process_group` param, so [c10d_logger](https://fburl.com/zlw2ukxp) logs information about the default process_group which is always `NCCL`. Test Plan: Before: Always defaults to NCCL even though GLOO is passed by caller. {F1950847585} After: GLOO backend shows up. {F1950848375} Differential Revision: D65255871 Pull Request resolved: https://github.com/pytorch/pytorch/pull/139428 Approved by: https://github.com/teja-rao, https://github.com/mhorowitz	2024-11-05 05:24:38 +00:00
Animesh Jain	fe4fa1df9f	[dynamo][eval_frame] Set the callback to None earlier for guard eval (#139655 ) xref - https://fb.workplace.com/groups/1075192433118967/permalink/1536570810314458/ Pull Request resolved: https://github.com/pytorch/pytorch/pull/139655 Approved by: https://github.com/jansel, https://github.com/williamwen42	2024-11-05 05:18:46 +00:00
Huy Do	fdfd4c50ba	Assign owners to periodic and slow jobs (#139519 ) As an outcome of https://fburl.com/gdoc/voce5o06, I want to assign owner(s) to any periodic or slows job that are still needed but couldn't run more frequently (too $$$, capacity constraint, don't fail that often). They include: * multigpu * debug build * ROCm (distributed, slow) @malfet @soulitzer I put down your names as the owners of debug build and slowgradcheck respectively. Please let me know if you are ok with that, or if you have a better option in mind. Any jobs there without an owner are owned by us (PT Dev Infra) ### Testing The owners are show up in the job name https://hud.pytorch.org/pr/139519 Pull Request resolved: https://github.com/pytorch/pytorch/pull/139519 Approved by: https://github.com/malfet	2024-11-05 04:48:12 +00:00
Gabriel Ferns	a766d84a3c	Allow inplacing buffer when other users are inconsequential (#138383 ) Summary: I think we can inplace a buffer if all of the users of said buffer are "inconsequential", defined as having been removed, being completed, or being part of the ancestors set. In particular, this allows LayerNorm to inplace its input buffer. Implements: https://github.com/pytorch/pytorch/issues/132826 Test Plan: New unit test of matmul followed by LayerNorm, make sure there's an inplaced buffer. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138383 Approved by: https://github.com/eellison	2024-11-05 03:44:09 +00:00
atalman	1e9390a30a	Add setuptools and wheel to cp312, cp313 and cp313t for Manylinux2_28 builds (#139636 ) Install setuptools and wheel dependencies for cp312, cp313, cp313t on Manylinux 2_28 images. This should resolve ``` ModuleNotFoundError: No module named 'setuptools' ``` On PR: https://github.com/pytorch/pytorch/pull/138732 This issue was addressed on XPU images already. We should apply the same fix for the rest of the images instead of keeping it XPU specific. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139636 Approved by: https://github.com/huydhn, https://github.com/chuanqi129	2024-11-05 03:25:35 +00:00
Andrew Gu	9039fbb47e	[FSDP2] Make module-to-state mapping use weakrefs (#139650 ) Without this, `del model` does not free memory of a module with FSDP2 applied. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139650 Approved by: https://github.com/yf225	2024-11-05 02:16:52 +00:00
cyy	5008d15ae9	[2/N] Remove usage of C array (#139589 ) Follows #139567 Pull Request resolved: https://github.com/pytorch/pytorch/pull/139589 Approved by: https://github.com/ezyang	2024-11-05 01:58:12 +00:00
CaoE	c92de3b5df	Add BRGEMM API versioning to be compatible with different oneDNN versions (#138184 ) oneDNN v3.6 updated the ukernel APIs of `brgemm` and `brgemm_pack_B`. Considering the upgrade of oneDNN, ukernel API versioning is needed to be compatible with different oneDNN versions. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138184 Approved by: https://github.com/jgong5, https://github.com/peterbell10	2024-11-05 01:26:27 +00:00
chuanqiw	299dbcde61	[CI] Fix xpu ci test with s3 cache (#139604 ) Fix a regression caused by https://github.com/pytorch/pytorch/pull/121323 Works for #114850 Pull Request resolved: https://github.com/pytorch/pytorch/pull/139604 Approved by: https://github.com/atalman, https://github.com/malfet	2024-11-05 01:23:21 +00:00
atalman	eaf92b2484	[Python 3.13 CD] Enable Aarch64 py3.13 builds (#138629 ) Adding CD aarch64. Part of: https://github.com/pytorch/pytorch/issues/130249 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138629 Approved by: https://github.com/ZainRizvi	2024-11-05 01:16:37 +00:00
David Berard	967cef294b	[inductor][triton 3.2] fix test_codegen_config_option_dont_assume_alignment for triton 3.2 (#139640 ) "divisible_by_16" was renamed "divisibility_16". Found in #139206. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139640 Approved by: https://github.com/aakhundov	2024-11-05 01:13:54 +00:00
CaoE	3672c688e3	Fix layout for SetSourceTensorKernel (#137973 ) Fixes #136837. `aten.set_.source_Tensor` will make the size and stride of the first input and output follow that of the second input: https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/native/TensorShape.cpp#L440. If the layouts of the two inputs are different, the following `assert_size_stride` will fail. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137973 Approved by: https://github.com/jgong5, https://github.com/jansel	2024-11-05 00:55:17 +00:00
Edward Yang	639162f39a	Add cache size to pt2_compile_events (#139627 ) Summary: I realized I wanted to check "are my cache entries/IO unreasonably large" and there's no easy way to do it. This lets me do it. Test Plan: servicelab Differential Revision: D65390363 Pull Request resolved: https://github.com/pytorch/pytorch/pull/139627 Approved by: https://github.com/c00w	2024-11-05 00:30:10 +00:00
Nikita Shulga	0058f71002	Don't use deprecated type properties in UpsampleKernel (#139399 ) By replacing `at::CPU(dtype)` pattern with `at::device(kCPU).dtype(dtype)` pattern Pull Request resolved: https://github.com/pytorch/pytorch/pull/139399 Approved by: https://github.com/Skylion007 ghstack dependencies: #139353, #139358	2024-11-05 00:29:58 +00:00
Nikita Shulga	b82a51bc6b	[BE] And delete `DeprecatedTypProperties` cast (#139358 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139358 Approved by: https://github.com/ezyang ghstack dependencies: #139353	2024-11-05 00:23:12 +00:00
PyTorch MergeBot	1b6f0b2a00	Revert "[BE] And delete `DeprecatedTypProperties` cast (#139358 )" This reverts commit 92a2a9ded22ef20a49e8c31dc2add93b40e8a78c. Reverted https://github.com/pytorch/pytorch/pull/139358 on behalf of https://github.com/ZainRizvi due to Change reverted internally due to broken builds. See D65378845 ([comment](https://github.com/pytorch/pytorch/pull/139358#issuecomment-2455959040))	2024-11-05 00:13:48 +00:00
PyTorch MergeBot	4a3ee96427	Revert "Don't use deprecated type properties in UpsampleKernel (#139399 )" This reverts commit 9d096e4d9ffc2b57a19cbefd5d4b5cce7306945b. Reverted https://github.com/pytorch/pytorch/pull/139399 on behalf of https://github.com/ZainRizvi due to Change reverted internally due to broken builds. See D65378845 ([comment](https://github.com/pytorch/pytorch/pull/139358#issuecomment-2455959040))	2024-11-05 00:13:48 +00:00
cyy	64d9ee88d7	[11/N] Fix extra warnings brought by clang-tidy-17 (#139599 ) Follows #139385 Pull Request resolved: https://github.com/pytorch/pytorch/pull/139599 Approved by: https://github.com/sraikund16	2024-11-04 23:57:41 +00:00
Laith Sakka	3f248a5735	Classify miss-inplaced tensors in logs. (#139240 ) Summary: use signpost logs, a followup is to remove the field possibly_missed_reinplacing_opportunities form dynamo compile table. Differential Revision: D65180194 Pull Request resolved: https://github.com/pytorch/pytorch/pull/139240 Approved by: https://github.com/zou3519	2024-11-04 23:56:14 +00:00
Mikayla Gawarecki	e947649e8f	[BE] Change _marked_safe_globals_list to set (#139303 ) Prevent same global from being added multiple times Pull Request resolved: https://github.com/pytorch/pytorch/pull/139303 Approved by: https://github.com/janeyx99 ghstack dependencies: #138936, #139221, #139433, #139541, #137602	2024-11-04 23:50:55 +00:00
Eddie Yan	1565eba4b4	[cuDNN][SDPA] Match `query`'s memory layout ordering for `output` in cuDNN SDPA (#138354 ) For #138340 ~~We might consider more sophisticated logic here but the corresponding logic in other backends doesn't seem to do anything fancy for non BSHD/BHSD cases `ea8ea2f33f/aten/src/ATen/native/transformers/cuda/attention.cu (L1145~~)` ended up going with a more general approach to much more or less arbitrary layouts Pull Request resolved: https://github.com/pytorch/pytorch/pull/138354 Approved by: https://github.com/drisspg	2024-11-04 23:49:09 +00:00
Pian Pawakapan	a678eaf1ad	check fake/real mismatches during real tensor prop (#137747 ) Summary: While testing exportability for PT2 Inference models, we found various cases of invalid op inputs during tracing, for example errors like: `a and b must have same reduction dim`, `expected scalar type Long but found Int`, etc. Looking more closely, these happened to due the same few meta kernels & eager kernels producing mismatched outputs upstream (e.g. different output tensor dtype, int output). Adding checks to catch mismatched outputs in real tensor prop upstream, so errors are raised at the mismatched op, instead of the downstream ops taking them as inputs. Relies a lot on utils from [CrossRefFakeMode](`929797dedb/torch/_subclasses/fake_utils.py (L78)`) Follow ups: could add more checks, and maybe have a flag to only enable these for cases like draft mode, so perf doesn't suffer? Test Plan: test_export, test_fake_tensor Differential Revision: D64210055 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137747 Approved by: https://github.com/zou3519	2024-11-04 23:39:48 +00:00
Bob Ren	9919932783	Specialize symfloats that flow through is_integer (#139572 ) Fixes `python test/dynamo/test_dynamic_shapes.py DynamicShapesFunctionTests.test_number_method_method_is_integer_num_type6_dynamic_shapes` when specialize_float = False Pull Request resolved: https://github.com/pytorch/pytorch/pull/139572 Approved by: https://github.com/ezyang ghstack dependencies: #139569, #139457, #139568	2024-11-04 23:35:35 +00:00
Henry Tsang	350bc2a166	[export] Add support for symbool to make it usable for torch.cond (#138765 ) # Why? I want the following code to work. minimal repro: ``` class M(torch.nn.Module): def forward(self, dilate_flag): return dilate_flag.item() input1 = (torch.tensor([1], dtype=torch.bool, device="cuda"),) model = M().cuda() ep = torch.export.export(model, input1, strict=True) path = torch._inductor.aot_compile(ep.module(), input1) aot_model = torch._export.aot_load(path, device="cuda") actual_output = aot_model(input1) ``` error: AssertionError: Encountered an unsupported object of type <class 'torch.SymBool'> while writing the metadata for exported program second error will be handled by https://github.com/pytorch/pytorch/pull/138760 # Motivation I could technically bypass it with a torch.int tensor. However, it doesn't work with torch.cond. I want the following to work. It would also require https://github.com/pytorch/pytorch/pull/138760 for aot compile to work. ``` class M(torch.nn.Module): def __init__(self) -> None: super().__init__() self.dilate_flag = 0 def forward(self, dilate_flag): self.dilate_flag = dilate_flag.item() def true_fn(dilate_flag): return dilate_flag.clone() def false_fn(dilate_flag): return dilate_flag.clone() torch.cond( self.dilate_flag, true_fn, false_fn, (dilate_flag,), ) return self.dilate_flag input1 = (torch.tensor([1], dtype=torch.bool, device="cuda"),) input2 = (torch.tensor([0], dtype=torch.bool, device="cuda"),) inputs = (input1, input2) model = M().cuda() for input in inputs: expected_output = model(input) ep = torch.export.export(model, input, strict=False) path = torch._inductor.aot_compile(ep.module(), input) aot_model = torch._export.aot_load(path, device="cuda") actual_output = aot_model(*input) assert ( expected_output == actual_output ), f"henry they are not equal {expected_output} != {actual_output}" ``` Differential Revision: D64867504 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138765 Approved by: https://github.com/ydwu4	2024-11-04 23:31:49 +00:00
PyTorch MergeBot	6add86a29f	Revert "Tighten type hints for tensor arithmetic (#135392 )" This reverts commit bf5cd8d0116d90d24b8acb38d578b8952dab22ef. Reverted https://github.com/pytorch/pytorch/pull/135392 on behalf of https://github.com/ZainRizvi due to Sorry but this is breaking lint on trunk. See [GH job link](https://github.com/pytorch/pytorch/actions/runs/11673543178/job/32504499599) [HUD commit link](`bf5cd8d011`) ([comment](https://github.com/pytorch/pytorch/pull/135392#issuecomment-2455908056))	2024-11-04 23:30:15 +00:00
Jane Xu	23169a6bcc	Disable foreach tests for complex128 internally (#139649 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139649 Approved by: https://github.com/ngimel	2024-11-04 23:24:47 +00:00
Tugsbayasgalan Manlaibaatar	87a379b61b	Move pippy to training IR (#139233 ) Differential Revision: [D65282662](https://our.internmc.facebook.com/intern/diff/D65282662) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139233 Approved by: https://github.com/kwen2501 ghstack dependencies: #138658, #139209	2024-11-04 23:07:14 +00:00
Yidi Wu	397938b453	[hop free symbols][refactor] lift freevar to parent graph before lifting to subgraph (#138559 ) This refactoring is for getting a deterministic ordering of binding tensors and sizes of tensors. When seeing a free tensor x with shape (s0,) in subgraph, the ordering of lifting changes from ``` lift_x_in_child, lift_s0_in_child, lift_s0_in_parent, lift_x_in_parent ``` to ``` lift_x_in_parent, lift_s0_in_parent, lift_x_in_child, lift_s0_in_child ``` This produces a determinstic ordering of handling the symints in lifted tensors. This is also the current contract of dynamo top-level graph: we lift free_symbols in sizes after tensor x and insert the free symbols before the tensor x's proxy. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138559 Approved by: https://github.com/zou3519 ghstack dependencies: #138345, #138428, #138558, #138737	2024-11-04 22:48:14 +00:00
Yidi Wu	c5b79699e1	[hop free symbols] replace ctx.save_for_backward to support symints/ints (#138737 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138737 Approved by: https://github.com/drisspg, https://github.com/zou3519, https://github.com/Chillee ghstack dependencies: #138345, #138428, #138558	2024-11-04 22:48:14 +00:00
Yidi Wu	ac20d0f893	[hop free symbols][refactor] make map's save_for_backward to handle int (#138558 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138558 Approved by: https://github.com/zou3519 ghstack dependencies: #138345, #138428	2024-11-04 22:48:07 +00:00
Yidi Wu	dc3a6a9d08	[hop free symbols][refactor] make create_graph_input always take example_value (#138428 ) Code refactoring only. We move the wrap_to_fake_tensor_logic out of wrap_fx_proxy for placeholders to provide the invariant that all graph inputs must set their example values when creating the inputs. This invariant helps us to identify all the free symbols in the graph in top-level and sub-graphs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138428 Approved by: https://github.com/ezyang, https://github.com/zou3519 ghstack dependencies: #138345	2024-11-04 22:47:49 +00:00
Yidi Wu	54c69a785b	[hop free symbols][refactor] make bound_symbols a dictionary (#138345 ) Code refactoring only. Change all self.tx.output.bound_symbols to self.tx.output.root_tracer.bound_symbols. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138345 Approved by: https://github.com/zou3519	2024-11-04 22:47:41 +00:00
Jane Xu	514c466cd9	Redirect the custom ops landing page :D (#139634 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139634 Approved by: https://github.com/zou3519	2024-11-04 22:25:15 +00:00
Felix Zimmermann	bf5cd8d011	Tighten type hints for tensor arithmetic (#135392 ) Fixes #124015 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135392 Approved by: https://github.com/ezyang	2024-11-04 22:10:04 +00:00
Henry Tsang	080e0ca584	[aoti tests] enable some aoti package tests for fbcode (#139359 ) Differential Revision: D65249372 Pull Request resolved: https://github.com/pytorch/pytorch/pull/139359 Approved by: https://github.com/angelayi	2024-11-04 22:06:07 +00:00
Will Constable	3d93caf664	[c10d] Add thread-safety initialization warning (#139638 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139638 Approved by: https://github.com/kwen2501, https://github.com/c-p-i-o, https://github.com/XilunWu	2024-11-04 21:38:47 +00:00
cyy	7deec3942f	[6/N] Don't skip ASAN on some tests (#139565 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/139565 Approved by: https://github.com/ezyang	2024-11-04 21:32:44 +00:00
atalman	91d38a5a82	Fix cuda Manylinux 2_28 docker images PATH setting (#139631 ) Enabling Manywheel builds here: https://github.com/pytorch/pytorch/pull/138732 During the build I observe the failure with cuda jobs: ``` -- Compiler does not support SVE extension. Will not build perfkernels. -- Found CUDA: /usr/local/cuda (found version "11.8") -- The CUDA compiler identification is unknown CMake Error at cmake/public/cuda.cmake:47 (enable_language): No CMAKE_CUDA_COMPILER could be found. Tell CMake where to find the compiler by setting either the environment variable "CUDACXX" or the CMake cache entry CMAKE_CUDA_COMPILER to the full path to the compiler, or to the compiler name if it is in the PATH. Call Stack (most recent call first): cmake/Dependencies.cmake:44 (include) CMakeLists.txt:851 (include) ``` While correct sequence suppose to be: ``` -- Found CUDA: /usr/local/cuda (found version "11.8") -- The CUDA compiler identification is NVIDIA 11.8.89 -- Detecting CUDA compiler ABI info -- Detecting CUDA compiler ABI info - done -- Check for working CUDA compiler: /usr/local/cuda/bin/nvcc - skipped -- Detecting CUDA compile features -- Detecting CUDA compile features - done -- Found CUDAToolkit: /usr/local/cuda/include (found version "11.8.89") ``` Issue found to be missing PATH setting in 2_28 Docker file. This section exist in CentOS Docker file here: https://github.com/pytorch/pytorch/blob/main/.ci/docker/manywheel/Dockerfile#L174-L175 (Please Note these Docker images are not used yet. The https://github.com/pytorch/pytorch/pull/138732 should enable using these images) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139631 Approved by: https://github.com/malfet, https://github.com/huydhn	2024-11-04 21:13:17 +00:00
Shunting Zhang	888110841c	[inductor] don't fuse two nodes if likely increase peak memory (#138756 ) Partially fixing https://github.com/pytorch/pytorch/issues/138685 Add a (relatively safe?) heuristics to skip fusion if we can potentially increasing peak memory. The doc string mainly explains what this PR is doing: ``` The implementation is more like a heuristic since we don't really know if we are at peak or not when trying to fuse these two ndoes. The order of nodes may change later which makes the peak memory estimation hard. Here is how we decide the LOWER BOUND of extra memory allocation if we fuse these 2 nodes: 1. find all buffers read by each node with a single user. These buffers are supposed to be reused if we don't fuses these 2 nodes 2. find the intersection of these buffers for the two node and sum the total buffer size. If we don't fuse these two nodes, we can at lease avoid this much memory allocation. Note that the extra memory allocation is not necessarily causing peak memory increase. This is just a heuristic. We return true only if the saving for fusion can not trade off the extra memory allocation. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/138756 Approved by: https://github.com/jansel ghstack dependencies: #139136	2024-11-04 20:49:29 +00:00
Ze Sheng	1aa71be56c	[PT2] Decouple decompose_triton_kernel_wrapper_functional from decompose_auto_functionalized (#139526 ) As title. We may not always want to remove the `triton_kernel_wrapper_functional` for example the references of [`unsafe_remove_auto_functionalized_pass`](`c8ab9b06a2/torch/export/_remove_auto_functionalized_pass.py (L48)`). Test Plan: CI & [D62592946](https://www.internalfb.com/diff/D62592946) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139526 Approved by: https://github.com/zou3519	2024-11-04 20:16:18 +00:00
Will Constable	71dc5df93c	[pipelining] Fix 'last backward' counting for dI / dW (#139415 ) Since any stage can run a mixture of full backwards and split backwards, it is important to count the sum of (full_backwards + backward_weight) when comparing to num microbatches to determine last backward. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139415 Approved by: https://github.com/H-Huang	2024-11-04 20:14:10 +00:00
Nikita Shulga	99413cd1a8	[CMake] Fix local MPS builds (#139651 ) Not sure how it works on some machines, but clean build fails for me after https://github.com/pytorch/pytorch/pull/138636 was landed, even though it works fine on another machine. Solution is to create an empty file when one adds a dependency, but later this dependency will be updated by the build rule Pull Request resolved: https://github.com/pytorch/pytorch/pull/139651 Approved by: https://github.com/atalman	2024-11-04 19:43:53 +00:00
Ryan Guo	30a83ca991	[dynamo] Improve codegen for `DataPtrVariable` and fix tensor reference issue (#139487 ) This addresses https://github.com/pytorch/pytorch/pull/137677/files#r1799836499, which had to set `allow_cache=False` for codegen on `DataPtrVariable.base`, which is a `TensorVariable`, otherwise we observe failure of `test_no_grad_copy` when testing with Dynamo. I've seen `test_no_grad_copy` failing a few times, and every single time it's related to cyclic reference, my best guess is the cyclic reference holds some tensor object longer in memory than necessary, preventing the optimization introduced in #11165. This patch makes `OutputGraph.cleanup()` more aggressive by clearing out all fields that might reference a `VariableTracker`. As a result, we can remove the aforementioned `allow_cache=False`, which helps generate better code (e.g., in the case of `test_no_grad_copy`, it skipped generating a redundant graph whose only op is returning the input tensor; instead we just generate a single `LOAD_FAST`). Pull Request resolved: https://github.com/pytorch/pytorch/pull/139487 Approved by: https://github.com/jansel, https://github.com/aakhundov	2024-11-04 19:14:06 +00:00
Bin Bao	740054ffe6	[AOTI][reland] Switch OSS dashboard to use aoti_compile_and_package (#139597 ) Summary: Reland https://github.com/pytorch/pytorch/pull/139154 Pull Request resolved: https://github.com/pytorch/pytorch/pull/139597 Approved by: https://github.com/angelayi	2024-11-04 18:53:17 +00:00
Oguz Ulgen	e76ce20177	Log to pt2 compile events (#139601 ) Summary: This option was added after I wrote the original diff, lets publish to pt2_compile_events Test Plan: CI Differential Revision: D65404910 Pull Request resolved: https://github.com/pytorch/pytorch/pull/139601 Approved by: https://github.com/jamesjwu	2024-11-04 18:39:06 +00:00
Shunting Zhang	4930c4b716	[inductor] patterns to remove pointless view/permute pairs (#139136 ) These are not artificial patterns I come up. They shows up in linear+CrossEntropyLoss graph. Consider this snippet: ``` class LinearAndCEL(nn.Module): def __init__(self): super().__init__() self.linear = nn.Linear(C, V) self.ce = nn.CrossEntropyLoss() def forward(self, x, y): return self.ce(self.linear(x).view(B * T, V), y.view(-1)) ``` `x` passed to `forward` is a 3D tensor of shape [B, T, C]. The `self.linear` will view x as [BxT, C] shape tensor first, do the matmul and produce a [BxT, V] tensor, and then view this output back to a 3D tensor with shape [B, T, V]. User code is gonna add another view op to convert the tensor shape to [B x T, V]. This generates a pair of redundant views . A pair of redundant permute happens in the backward part when we compute gradients. The view ops makes it hard to chunk linear+CEL. When the view op breaks up the dimension being chunked, what should the chunker do (even if we merge those dimension again later)? Removing these pointless view pairs makes the chunker simpler. And I think it's in general nice to do. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139136 Approved by: https://github.com/Chillee, https://github.com/jansel	2024-11-04 18:39:02 +00:00
Mikayla Gawarecki	ca43ecd599	Flip default on weights_only (#137602 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137602 Approved by: https://github.com/malfet, https://github.com/albanD ghstack dependencies: #138936, #139221, #139433, #139541	2024-11-04 18:30:29 +00:00
Mikayla Gawarecki	f55dfbcf87	Remove hasattr(__slots__) for BUILD logic in weights_only unpickler (#139541 ) This is tested in PR stacked above in ```python python test/distributed/fsdp/test_fsdp_state_dict.py TestFSDPStateDict.test_torch_save_load ``` We cannot depend on whether `hasattr(..., __slots__)` to know whether a BUILD instruction has slotstate. For example, if a class subclasses ABC `hasattr(__slots__)` will be `True` but there might be no slots (and hence `state` will not be a tuple). So revert #138936 to following the pickle library's code ```python >>> from abc import ABC >>> hasattr(ABC, "__slots__") True ``` So ```python import torch from abc import ABC from dataclasses import dataclass class Foo(ABC): pass class FooWrapper(Foo): def __init__(self, x, y): self.x = x self.y = y f = FooWrapper(1, 2) torch.save(f, "temp.pt") with torch.serialization.safe_globals([FooWrapper]): torch.load("temp.pt") ``` Would fail on the previous code with ``` File "/data/users/mg1998/pytorch/torch/serialization.py", line 1934, in _load result = unpickler.load() File "/data/users/mg1998/pytorch/torch/_weights_only_unpickler.py", line 366, in load for k, v in slotstate.items(): ``` As there is actually no slotstate Pull Request resolved: https://github.com/pytorch/pytorch/pull/139541 Approved by: https://github.com/malfet ghstack dependencies: #138936, #139221, #139433	2024-11-04 18:30:29 +00:00
Tugsbayasgalan Manlaibaatar	ae0e7042f6	Fix custom obj being input (#139209 ) Differential Revision: [D65158939](https://our.internmc.facebook.com/intern/diff/D65158939) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139209 Approved by: https://github.com/ydwu4 ghstack dependencies: #138658	2024-11-04 18:24:29 +00:00
rzou	85c3c4132d	no-op torch.library.custom_op APIs on torch.deploy (#139509 ) We forgot this case in the previous PR. Fixes https://github.com/pytorch/pytorch/issues/137536 Test Plan: - better tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/139509 Approved by: https://github.com/williamwen42	2024-11-04 18:01:08 +00:00
PyTorch MergeBot	6dada2136a	Revert "Refactor FxGraphDrawer to use HTML-like labels (#137726 )" This reverts commit 1e738420296a84406cd0a1626074ea6447a6603a. Reverted https://github.com/pytorch/pytorch/pull/137726 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but it looks like some internal components are failing after this change and need to be updated ([comment](https://github.com/pytorch/pytorch/pull/137726#issuecomment-2455332612))	2024-11-04 17:44:44 +00:00
Tugsbayasgalan Manlaibaatar	e080c89bdc	Make test_torchbind.py training IR compatible (#138658 ) In this diff, i make test_torchbind.py tests to handle training IR. Today in the training IR, we don't see the effect token and HOP because this happens at the FunctionalTensorMode. Maybe in the future, we should move this logic up to the training IR so that writing passes etc on training Ir is safer. But for the migration purposes, i think it is ok for now. I also fixed two bugs: 1. ep.module() doesn't register all aliased constants in the module. 2. When we retrace, we need to fakify the original Torchbind object. 3. We don't run any DCE on training IR so we need to add some more torch ops to verifier. Differential Revision: [D64853530](https://our.internmc.facebook.com/intern/diff/D64853530) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138658 Approved by: https://github.com/ydwu4, https://github.com/zhxchen17	2024-11-04 17:43:11 +00:00
Bob Ren	68c515b292	don't run z3 analysis on backed symfloat nodes (#139568 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139568 Approved by: https://github.com/ezyang ghstack dependencies: #139569, #139457	2024-11-04 17:04:29 +00:00
Natalia Gimelshein	d3fc13a9dd	use more elements per thread for narrow dtypes (#139449 ) Fix perf issue for narrow type by accessing more elements per thread Pull Request resolved: https://github.com/pytorch/pytorch/pull/139449 Approved by: https://github.com/Chillee, https://github.com/eqy	2024-11-04 16:43:33 +00:00
PyTorch MergeBot	3ca794783f	Revert "[SymmetricMemory] introduce a binding for cuMemset32Async (#138755 )" This reverts commit 924e726c3a2566125f55cdbff4dff054d3db3232. Reverted https://github.com/pytorch/pytorch/pull/138755 on behalf of https://github.com/ZainRizvi due to Sorry but this breaks internally. Can you please fix this PR so it works internally and re-merge it? See D65401876 for more details ([comment](https://github.com/pytorch/pytorch/pull/138755#issuecomment-2455173596))	2024-11-04 16:34:34 +00:00
Bob Ren	87404b6ca6	support symfloats in translation validation (#139457 ) fixes `python test/dynamo/test_dynamic_shapes.py DynamicShapesHigherOrderOpTests.test_cond_pytree_operands_with_non_tensor_leaves_dynamic_shapes` when `specialize_float=False` Pull Request resolved: https://github.com/pytorch/pytorch/pull/139457 Approved by: https://github.com/ezyang ghstack dependencies: #139569	2024-11-04 15:40:08 +00:00
Richard Barnes	6b8e3022f2	Remove c10::optional usages in PyTorch (#139525 ) Test Plan: Sandcastle Reviewed By: swolchok Pull Request resolved: https://github.com/pytorch/pytorch/pull/139525 Approved by: https://github.com/malfet, https://github.com/Skylion007	2024-11-04 15:35:23 +00:00
cyy	419a7e197d	[6/N] Fix Wextra-semi warning (#139605 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/139605 Approved by: https://github.com/ezyang	2024-11-04 13:43:16 +00:00
PyTorch UpdateBot	2ce2e4df4e	Update slow tests (#139051 ) This PR is auto-generated weekly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/weekly.yml). Update the list of slow tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139051 Approved by: https://github.com/pytorchbot	2024-11-04 11:49:06 +00:00
Bob Ren	12d225d91c	add opaque unary sin and cos to SYMPY_INTERP (#139569 ) Fixes `PYTORCH_TEST_WITH_DYNAMO=1 python test/test_nn.py TestNNDeviceTypeCPU.test_affine_3d_rotateRandom_cpu` when specialize_float = False Pull Request resolved: https://github.com/pytorch/pytorch/pull/139569 Approved by: https://github.com/ezyang	2024-11-04 07:37:11 +00:00
Sun, Jiayi	3337439dc0	[inductor] modify the heuristic for disabling vectorization (#136422 ) Summary Since we have already implemented tail loop mask vectorization (https://github.com/pytorch/pytorch/pull/126526), I re-tuned the heuristics for disabling vectorization from performance perspective. I changed the heuristic to: when the total number of elements along the vec dim is less than `tiling_factor/4` and the number of operations is less than 10, we disable the vectorization. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136422 Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5, https://github.com/jansel	2024-11-04 07:33:32 +00:00
James Wu	f4ee5a243d	Add PT2 Compile Events for triton and kernel compilation + load_by_key_path (#139402 ) Adds a few more dynamo_timed() to measure triton compilation and load_by_key_path times. In the case of async compilation with multiple threads, we'll generate a single `kernel_compile` event that occurs when waiting on all the parallel compiles to finish. In the case where async parallel compilation is disabled (or, compile threads are warming up), we'll generate a `triton_compile` event for each kernel. The `triton_compile` events is a bit questionable: do we need a row for each triton compile event? It might eat up on our already low retention, so I might just remove that. Will discuss with @slarsen. Differential Revision: [D65215707](https://our.internmc.facebook.com/intern/diff/D65215707/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139402 Approved by: https://github.com/oulgen	2024-11-04 06:37:18 +00:00
cyy	3179eb15ae	[1/N] Remove usage of C array (#139567 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/139567 Approved by: https://github.com/Skylion007, https://github.com/ezyang	2024-11-04 04:52:46 +00:00
Yuxin Wu	cadc50e7e9	LOG(INFO) -> VLOG(2) in ProcessGroupNCCL (#130696 ) In the same spirit as https://github.com/pytorch/pytorch/pull/105695 Initialization and error handling logs are mostly kept. Routine logs are changed to VLOG. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130696 Approved by: https://github.com/kwen2501 Co-authored-by: Ke Wen <kw2501@fb.com>	2024-11-04 04:43:42 +00:00
Jason Ansel	ed30fa74ab	[inductor] sympy.Integer([01]) -> sympy.S.(Zero\|One) (#139523 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139523 Approved by: https://github.com/ezyang ghstack dependencies: #139364, #139365, #139370, #139452	2024-11-04 04:28:40 +00:00
Jason Ansel	b6fb135c2c	[inductor] Simplify remove_kernel_local_buffers (#139452 ) I plan to reuse `can_buffer_be_removed_through_fusion` in some heuristics. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139452 Approved by: https://github.com/shunting314 ghstack dependencies: #139364, #139365, #139370	2024-11-04 04:28:40 +00:00
Jason Ansel	3d633f12ba	[inductor] Move remove_kernel_local_buffers to Kernel (#139370 ) This method mutates the kernel, so it fits better in that class. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139370 Approved by: https://github.com/shunting314 ghstack dependencies: #139364, #139365	2024-11-04 04:28:33 +00:00
Jason Ansel	66d5e2405d	[inductor] Remove Node.last_usage mutation (#139365 ) I can't figure out why this is needed. Let's see if tests fail. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139365 Approved by: https://github.com/shunting314 ghstack dependencies: #139364	2024-11-04 04:28:25 +00:00
Jason Ansel	d189f92eb1	[inductor] Remove SIMDKernel.last_usage (#139364 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139364 Approved by: https://github.com/eellison, https://github.com/shunting314	2024-11-04 04:28:18 +00:00
Animesh Jain	e6ff07f00e	[dynamo][guards] Consider tensors as immutable for dict tag matches (#139560 ) This is a bug on the main exposed by https://github.com/pytorch/pytorch/issues/139476 We have dict tag optimization where if the dict tag does not change, we skip guards on all the items of the dict that are "immutable". We considered tensors as immutable in such scenarios. This is critical for guard eval performance, because generally users dont change their parameters. If I try to remove this optimization, we see slowdowns, e.g, 3.03x to 2.95x on conv_mixer TIMM benchamrk. So, I am adding a flag which keeps the current state but allows the users to remove this optimization. Not ideal, but given how serious guard eval perf has to be, we are in the gray are of unsoundness vs performance tradeoff. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139560 Approved by: https://github.com/jansel	2024-11-04 00:54:20 +00:00
cyy	7f387fa612	[10/N] Fix extra warnings brought by clang-tidy-17 (#139385 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/139385 Approved by: https://github.com/Skylion007	2024-11-04 00:47:19 +00:00
briancoutinho	3242049daa	[profiler] Annotate triton kernels with kernel hash (#139531 ) As above, annotates triton kernel hash in the profile attributes. Added a new unit test in profiler to triton/dynamo events. Testplan: Running new unit test in CI Internal: buck2 run @mode/dev-nosan caffe2/test:profiler -- -r test_pt2_triton_attributes Running on an example, this is how the kernel hash file looks ``` { "ph": "X", "cat": "cpu_op", "name": "triton_poi_fused_add_cos_sin_0", "pid": 1670242, "tid": 1670242, "ts": 2413669097354.058, "dur": 95.812, "args": { "External id": 3,"kernel_hash": "cqaokwf2bph4egogzevc22vluasiyuui4i54zpemp6knbsggfbuu", "grid": "grid(100,)", "Record function id": 0, "stream": 0, "Concrete Inputs": ["", "", "", "100"], "kernel_file": "/tmp/torchinductor_bcoutinho/qa/cqaokwf2bph4egogzevc22vluasiyuui4i54zpemp6knbsggfbuu.py", "kernel_backend": "triton", "Input type": ["float", "float", "float", "Scalar"], "Input Strides": [[10, 1], [10, 1], [10, 1], []], "Input Dims": [[10, 10], [10, 10], [10, 10], []], "Ev Idx": 2 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/139531 Approved by: https://github.com/davidberard98	2024-11-03 23:19:35 +00:00
Yifu Wang	924e726c3a	[SymmetricMemory] introduce a binding for cuMemset32Async (#138755 ) ## This Stack This stack does the following things to support `xformers`-style, comm-aware Triton kernels: - Exposes `signal_pad`s as tensors in Python - Adds a binding for `cuMemsetAsync` These in combination aims to provide users with more flexibility to express custom signaling/synchronization patterns. ## This PR Make `cuMemset32Async` available via `_SymmetricMemory.memset32`. We chose `cuMemset32Async` over `cudaMemsetAsync` because it allows for `uint32_t`-wise memset. This provides users with better flexibility. To enable this, we also added the following cuda driver APIs in `c10::cuda::DriverAPI`: - `cuDevicePrimaryCtxRetain` - for obtaining the primary context of a device in the form of `CUcontext`. - `cuCtxGetCurrent`/`cuCtxSetCurrent` - for setting and restoring the context for cuda driver APIs such as `cuMemset32Async`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138755 Approved by: https://github.com/weifengpy, https://github.com/eqy, https://github.com/lw	2024-11-03 21:37:31 +00:00
Bob Ren	5d07651c72	only use hint_size in _smart_symbol_sort for size type symbols (#139571 ) Fixes `PYTORCH_TEST_WITH_DYNAMO=1 python test/test_torch.py TestTorchDeviceTypeCPU.test_exponential_kstest_cpu_bfloat16` when specialize_float = False Pull Request resolved: https://github.com/pytorch/pytorch/pull/139571 Approved by: https://github.com/ezyang ghstack dependencies: #139451, #139482, #139484, #139486	2024-11-03 21:15:08 +00:00
cyy	57a49018b1	[5/N] Fix Wextra-semi warning (#139465 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/139465 Approved by: https://github.com/ezyang	2024-11-03 20:40:50 +00:00
cyy	03e83111f5	Remove unnecessary check of CUDA 10.2 (#139566 ) Since PyTorch now requires higher CUDA. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139566 Approved by: https://github.com/ezyang	2024-11-03 20:04:37 +00:00
leslie-fang-intel	d84a344410	[Inductor] Skip coordinate_descent_tuning for mm/bmm decomposition on CPU (#139537 ) Summary Fix issue: https://github.com/pytorch/pytorch/issues/138823, `coordinate_descent_tuning` doesn't benefit on CPU and prefer lowering `mm`/`bmm` into ATEN kernels or CPP GEMM Template. Test Plan ``` python -u -m pytest -s -v test/inductor/test_cpu_select_algorithm.py -k test_cpp_coordinate_descent_tuning ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/139537 Approved by: https://github.com/jansel	2024-11-03 10:10:29 +00:00
Edward Z. Yang	585dbfa583	Profile guided optimization for automatic_dynamic (#139001 ) Previously: https://github.com/pytorch/pytorch/pull/138052 but the implementation is done from scratch, so I open a new PR. This implements the ability to save and load profiles of automatic dynamic decisions, so on subsequent runs we can directly make something automatically dynamic. Unlike the previous implementation, this cache is never enabled by default; instead, you have to specify a "job id" that says it's OK to share results. We will be able to automatically populate this id for internal MAST jobs but for generic OSS users you will have to explicitly opt into it. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/139001 Approved by: https://github.com/oulgen	2024-11-03 06:29:57 +00:00
PyTorch MergeBot	3a2ab9584f	Revert "[executorch hash update] update the pinned executorch hash (#139536 )" This reverts commit 468d592fbc12dfc67d89f954781ccbf540241470. Reverted https://github.com/pytorch/pytorch/pull/139536 on behalf of https://github.com/huydhn due to This is breaking trunk, need to fix before relanding ([comment](https://github.com/pytorch/pytorch/pull/139536#issuecomment-2453313984))	2024-11-03 06:25:41 +00:00
Bob Ren	a1370259ba	always specialize float on export path (#139486 ) This is the next step in support dynamic float arguments in PT2: docs.google.com/document/d/1HswUSp9H6mg8Vg27mhRk8YzC9q_uf63b6wz-gwx65BQ/edit?pli=1#heading=h.xvyiqp8tuje6. To make this more incremental and tractable, we've decided to opt the export path our of this first phase of the rollout. Fixes python test/export/test_export.py TestExport.test_export_input_mutation_dynamic_shape when specialize_float=False Pull Request resolved: https://github.com/pytorch/pytorch/pull/139486 Approved by: https://github.com/ezyang ghstack dependencies: #139451, #139482, #139484	2024-11-03 04:47:12 +00:00
Bob Ren	25f243ff5d	Update tensorify pass to specialize symfloats we didn't tensorify away (#139564 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139564 Approved by: https://github.com/huydhn	2024-11-03 04:27:43 +00:00
Nikita Shulga	b3ad45733b	[Lint] Clang-format all metal kernels (#139530 ) Except Quantized.metal, where linting breaks all the ASCII art Pull Request resolved: https://github.com/pytorch/pytorch/pull/139530 Approved by: https://github.com/cyyever, https://github.com/Skylion007 ghstack dependencies: #139522	2024-11-03 04:14:20 +00:00
PyTorch UpdateBot	468d592fbc	[executorch hash update] update the pinned executorch hash (#139536 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned executorch hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139536 Approved by: https://github.com/pytorchbot, https://github.com/huydhn Co-authored-by: Huy Do <huydhn@gmail.com>	2024-11-03 03:14:06 +00:00
PyTorch MergeBot	067d2a089d	Revert "Expose Storage _use_count API in Python (#139426 )" This reverts commit e31136d07bbfb10735df101df953c73d22dde24b. Reverted https://github.com/pytorch/pytorch/pull/139426 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but it is failing some inductor job in trunk ([comment](https://github.com/pytorch/pytorch/pull/139426#issuecomment-2453269063))	2024-11-03 02:40:45 +00:00
Bob Ren	b8b60e0bc5	add is_integer to support example_value function whitelist (#139484 ) Fixes python test/dynamo/test_dynamic_shapes.py DynamicShapesFunctionTests.test_is_integer_dynamic_shapes when specialize_float=False Pull Request resolved: https://github.com/pytorch/pytorch/pull/139484 Approved by: https://github.com/ezyang ghstack dependencies: #139451, #139482	2024-11-03 02:01:38 +00:00
Ke Wen	f121eab018	[c10d] Remove dead Dynamo marker (#139545 ) Per discussion with @anijain2305, `dynamo_unsupported_distributed_c10d_ops` is not referenced anywhere. Removing this dead code. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139545 Approved by: https://github.com/Skylion007	2024-11-03 00:40:26 +00:00
Syed Tousif Ahmed	0f06dff4d7	Restores release_lock_on_cudamalloc behavior in CUDACachingAllocator (#139430 ) In https://github.com/pytorch/pytorch/pull/134685, I transformed the following code: ```CPP if (CUDAAllocatorConfig::release_lock_on_cudamalloc()) { // At scope exit, acquire the lock again. This provides safety against // any potential exceptions in the cudaMallocMaybeCapturing function. auto sg = c10::make_scope_exit([&]() { lock.lock(); }); lock.unlock(); p.err = cudaMallocMaybeCapturing(&ptr, size); } else { p.err = cudaMallocMaybeCapturing(&ptr, size); } if (CUDAAllocatorConfig::release_lock_on_cudamalloc()) { TORCH_CHECK( lock.owns_lock(), "Failed to acquire lock after cudaMalloc"); } ``` into: ```CPP if (CUDAAllocatorConfig::release_lock_on_cudamalloc()) { // At scope exit, acquire the lock again. This provides safety against // any potential exceptions in the cudaMallocMaybeCapturing function. auto sg = c10::make_scope_exit([&]() { lock.lock(); }); lock.unlock(); } auto active_pool = MemPoolContext::getActiveMemPool(); if (active_pool && active_pool->allocator() && p.pool->owner_PrivatePool) { ptr = active_pool->allocator()->raw_alloc(size); p.err = ptr ? cudaSuccess : cudaErrorMemoryAllocation; } else { p.err = cudaMallocMaybeCapturing(&ptr, size); } if (CUDAAllocatorConfig::release_lock_on_cudamalloc()) { TORCH_CHECK( lock.owns_lock(), "Failed to acquire lock after cudaMalloc"); } ``` This is wrong because, I didn't realize what `c10::make_scope_exit([&]() { lock.lock(); });` does. And so my changes doesn't let `release_lock_on_cudamalloc` unlock..execute alloc..lock, and instead it just unlock..locks. This PR rectifies that change, and in addition adds an ASSERT ensuring the active pool and p.pool are the same (mirroring the behavior from released_cached_blocks). Thanks @zvon82 for reporting this! Pull Request resolved: https://github.com/pytorch/pytorch/pull/139430 Approved by: https://github.com/ezyang	2024-11-03 00:04:30 +00:00
Yukio Siraichi	a3cb8ee38b	AOTAutograd: Make general `SymInt` hashable when merging view inputs. (#139553 ) Fix: #139111 This PR wraps `SymInt` input arguments with `SymIntEqByExpr`, making them hashable when merging view inputs (`merge_view_inputs` function). Pull Request resolved: https://github.com/pytorch/pytorch/pull/139553 Approved by: https://github.com/ezyang	2024-11-02 23:57:11 +00:00
Yuanhao Ji	b46e1fc141	[Dynamo] Fix graph break when `tensor.split()` is called within a device context manager (#139270 ) Fixes: #139183 Note: this case can also be reproduced on cpu Pull Request resolved: https://github.com/pytorch/pytorch/pull/139270 Approved by: https://github.com/ezyang Co-authored-by: Vincent Moens <vincentmoens@gmail.com>	2024-11-02 23:55:51 +00:00
Jane Xu	e31136d07b	Expose Storage _use_count API in Python (#139426 ) Would be nice to replace the torch._C._storage_Use_Count call in https://github.com/pytorch/torchtune/pull/1936, at least without needing to know about _cdata in OSS code. Initially keeping it private as Tensor._use_count is also private. In favor over https://github.com/pytorch/pytorch/pull/139109 in solving the same problem, as exposing an existing API is better than adding a new one (and this enables a more robust fix) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139426 Approved by: https://github.com/soulitzer	2024-11-02 23:36:31 +00:00
axel	f6e5d09682	Raise error for int64 and bool dtypes in nanmean, even for empty tensors (#138745 ) This PR ensures that the `nanmean()` function raises a `RuntimeError` when using `int64` or `bool` dtypes, even for empty tensors. Previously, non-empty tensors correctly raised errors for unsupported dtypes, while empty tensors did not. This change brings consistent error handling for both cases. addressing the need raised in an issue by @hyperkai (Issue [#131043](https://github.com/pytorch/pytorch/issues/131043)). ### Changes - Added checks in `nanmean_out()` to raise errors for `int64` and `bool` dtypes regardless of tensor size. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138745 Approved by: https://github.com/ezyang	2024-11-02 22:52:40 +00:00
Bob Ren	232af152b5	Fix graph breaks related to specialized float inputs (#139482 ) Fixes issue with timm models where example_value = 0.09999 proxy.node.target = <built-in function sub> would fall through to ``` unimplemented( "torch.* op returned non-Tensor " + f"{typestr(example_value)} {proxy.node.op} {proxy.node.target}", case_name="unsupported_operator", ) ``` and graph break Pull Request resolved: https://github.com/pytorch/pytorch/pull/139482 Approved by: https://github.com/ezyang ghstack dependencies: #139451	2024-11-02 21:58:46 +00:00
PyTorch MergeBot	854be65fa0	Revert "[PGNCCL] Make sure we do not use split for P2P comm creation (#139013 )" This reverts commit 55038aa66162372acc1041751d5cc5c8ed9bc304. Reverted https://github.com/pytorch/pytorch/pull/139013 on behalf of https://github.com/kwen2501 due to More flavor of test_manual_with_data_parallel failed ([comment](https://github.com/pytorch/pytorch/pull/139013#issuecomment-2453085932))	2024-11-02 18:29:10 +00:00
Ke Wen	e9eb7b1b13	[CI] Skip test_cuda_tracker_equivalence for ROCm (#139543 ) Test fails on ROCm, skipping it for this platform. Resolves https://github.com/pytorch/pytorch/issues/139515 Pull Request resolved: https://github.com/pytorch/pytorch/pull/139543 Approved by: https://github.com/huydhn	2024-11-02 15:39:07 +00:00
PyTorch MergeBot	92d7f29e59	Revert "Profile guided optimization for automatic_dynamic (#139001 )" This reverts commit f6be44c74e012fb4329e6e716ebb78e9f5092a3b. Reverted https://github.com/pytorch/pytorch/pull/139001 on behalf of https://github.com/ezyang due to more fbcode errors ([comment](https://github.com/pytorch/pytorch/pull/139001#issuecomment-2452985581))	2024-11-02 13:11:04 +00:00
PyTorch MergeBot	709752e0bb	Revert "[AOTI] Switch OSS dashboard to use aoti_compile_and_package (#139154 )" This reverts commit 293fbb42d207058d49f0ae40ca408214ee88b76b. Reverted https://github.com/pytorch/pytorch/pull/139154 on behalf of https://github.com/desertfire due to cpu_aot_inductor_amp_freezing fails ([comment](https://github.com/pytorch/pytorch/pull/139154#issuecomment-2452983651))	2024-11-02 13:04:00 +00:00
Edward Z. Yang	f6be44c74e	Profile guided optimization for automatic_dynamic (#139001 ) Previously: https://github.com/pytorch/pytorch/pull/138052 but the implementation is done from scratch, so I open a new PR. This implements the ability to save and load profiles of automatic dynamic decisions, so on subsequent runs we can directly make something automatically dynamic. Unlike the previous implementation, this cache is never enabled by default; instead, you have to specify a "job id" that says it's OK to share results. We will be able to automatically populate this id for internal MAST jobs but for generic OSS users you will have to explicitly opt into it. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Differential Revision: [D65065497](https://our.internmc.facebook.com/intern/diff/D65065497) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139001 Approved by: https://github.com/oulgen	2024-11-02 11:50:11 +00:00
Ke Wen	55038aa661	[PGNCCL] Make sure we do not use split for P2P comm creation (#139013 ) Resolve comment https://github.com/pytorch/pytorch/pull/138527#issuecomment-2438613172 There was a split-vs-P2P bug: When P2P comm creation invokes `getNCCLComm`, it may see a `split_from` options which is meant for the previous PG creation. Then the P2P comm creation may use `ncclCommSplit` and hang, because not all ranks join this call. The bug slips previously/today because there is no CI test with the following recipe: eager init + new group + P2P in that new group. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139013 Approved by: https://github.com/shuqiangzhang	2024-11-02 07:47:55 +00:00
PyTorch MergeBot	2a3fe06ce0	Revert "[Partitioner] Enumerate partitions by iterating partition ids (#136598 )" This reverts commit 39ec5a20ea3d7bc8c2147f8363f8a06f4bb1e953. Reverted https://github.com/pytorch/pytorch/pull/136598 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it fails an executorch test https://github.com/pytorch/executorch/blob/main/exir/backend/test/test_graph_partition.py#L114-L175 ([comment](https://github.com/pytorch/pytorch/pull/136598#issuecomment-2452903705))	2024-11-02 07:19:22 +00:00
PyTorch MergeBot	f3238106fd	Revert "Allow inplacing buffer when other users are inconsequential (#138383 )" This reverts commit 030f70b40bca62993bd65d03c58ded45601abe35. Reverted https://github.com/pytorch/pytorch/pull/138383 on behalf of https://github.com/huydhn due to Sorry for reverting this again, but I think it has a test failing internally and also on ROCm ([comment](https://github.com/pytorch/pytorch/pull/138383#issuecomment-2452898229))	2024-11-02 06:53:48 +00:00
PyTorch MergeBot	0863d6a08e	Revert "[inductor] Remove SIMDKernel.last_usage (#139364 )" This reverts commit 286d3ce266ce01ca905afb1cc9ea5d81abf79ff7. Reverted https://github.com/pytorch/pytorch/pull/139364 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing lots of internal tests in D65345157 ([comment](https://github.com/pytorch/pytorch/pull/139364#issuecomment-2452897337))	2024-11-02 06:49:11 +00:00
PyTorch MergeBot	9331640e26	Revert "[inductor] Remove Node.last_usage mutation (#139365 )" This reverts commit 1e934b473cabe6bc003f66d9811082e97c958a31. Reverted https://github.com/pytorch/pytorch/pull/139365 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing lots of internal tests in D65345157 ([comment](https://github.com/pytorch/pytorch/pull/139364#issuecomment-2452897337))	2024-11-02 06:49:10 +00:00
PyTorch MergeBot	dc4b459737	Revert "[inductor] Move remove_kernel_local_buffers to Kernel (#139370 )" This reverts commit b57b4b7f9b168389def15ea06a4dcf9e5f6f4f04. Reverted https://github.com/pytorch/pytorch/pull/139370 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing lots of internal tests in D65345157 ([comment](https://github.com/pytorch/pytorch/pull/139364#issuecomment-2452897337))	2024-11-02 06:49:10 +00:00
PyTorch MergeBot	66a401c9e1	Revert "[inductor] Simplify remove_kernel_local_buffers (#139452 )" This reverts commit 73c0762a34ef152450287dbc365cb8db930031b7. Reverted https://github.com/pytorch/pytorch/pull/139452 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing lots of internal tests in D65345157 ([comment](https://github.com/pytorch/pytorch/pull/139364#issuecomment-2452897337))	2024-11-02 06:49:10 +00:00
PyTorch MergeBot	98e11b0021	Revert "[inductor] sympy.Integer([01]) -> sympy.S.(Zero\|One) (#139523 )" This reverts commit c53beab3775671b5b7ec6106737c0d8939b8455a. Reverted https://github.com/pytorch/pytorch/pull/139523 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing lots of internal tests in D65345157 ([comment](https://github.com/pytorch/pytorch/pull/139364#issuecomment-2452897337))	2024-11-02 06:49:10 +00:00
Bob Ren	fdd298dcb7	add hex method on SymFloat (#139451 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139451 Approved by: https://github.com/ezyang	2024-11-02 05:33:19 +00:00
PyTorch MergeBot	8d1eaa3da6	Revert "Profile guided optimization for automatic_dynamic (#139001 )" This reverts commit a6630bcf8736e4d66375688dfd8b45c401de3fef. Reverted https://github.com/pytorch/pytorch/pull/139001 on behalf of https://github.com/ezyang due to internal code triggers import cycle ([comment](https://github.com/pytorch/pytorch/pull/139001#issuecomment-2452833882))	2024-11-02 03:38:15 +00:00
drisspg	540f3ef9b1	Fix flex_decode to build offsets off of strides (#139516 ) Fixes PR: https://github.com/pytorch/pytorch/issues/139462 Pull Request resolved: https://github.com/pytorch/pytorch/pull/139516 Approved by: https://github.com/Chillee	2024-11-02 03:17:46 +00:00
Bin Bao	293fbb42d2	[AOTI] Switch OSS dashboard to use aoti_compile_and_package (#139154 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139154 Approved by: https://github.com/angelayi ghstack dependencies: #139153	2024-11-02 03:10:05 +00:00
Bin Bao	a46a79fe92	[AOTI] Ignore .o files in package_aoti (#139153 ) Summary: There is no point to package .o files since a .so file is included in that package. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139153 Approved by: https://github.com/angelayi	2024-11-02 03:10:05 +00:00
Jason Ansel	c53beab377	[inductor] sympy.Integer([01]) -> sympy.S.(Zero\|One) (#139523 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139523 Approved by: https://github.com/ezyang ghstack dependencies: #139364, #139365, #139370, #139452	2024-11-02 03:04:22 +00:00
Justin Chu	387b120549	[ONNX] Remove type promotion rule for pow (#139527 ) ONNX supports different input types in Pow, so type promotion is not needed. The resulting graph is the following: ```py ONNXProgram( model= < ir_version=9, opset_imports={'': 18, 'pkg.onnxscript.torch_lib.common': 1}, producer_name='pytorch', producer_version='2.6.0a0+git59a1af5', domain=None, model_version=None, > graph( name=main_graph, inputs=( %"x"<FLOAT16,[3]> ), outputs=( %"pow_1"<FLOAT16,[3]> ), ) { 0 \| # node_Constant_0 %"val_0"<?,?> ⬅️ ::Constant() {value=Tensor<FLOAT,[]>(array(2., dtype=float32), name=None)} 1 \| # node_Pow_1 %"pow_1"<FLOAT16,[3]> ⬅️ ::Pow(%"x", %"val_0") return %"pow_1"<FLOAT16,[3]> } ... , exported_program= ExportedProgram: class GraphModule(torch.nn.Module): def forward(self, x: "f16[3]"): # File: /workspace/pytorch/test/onnx/exporter/test_small_models_e2e.py:53 in forward, code: return x**2.0 pow_1: "f16[3]" = torch.ops.aten.pow.Tensor_Scalar(x, 2.0); x = None return (pow_1,) Graph signature: ExportGraphSignature(input_specs=[InputSpec(kind=<InputKind.USER_INPUT: 1>, arg=TensorArgument(name='x'), target=None, persistent=None)], output_specs=[OutputSpec(kind=<OutputKind.USER_OUTPUT: 1>, arg=TensorArgument(name='pow_1'), target=None)]) Range constraints: {} ) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/139527 Approved by: https://github.com/titaiwangms	2024-11-02 02:19:50 +00:00
Matthew Sterrett	7e65060410	Adds support for accelerated sorting with x86-simd-sort (#127936 ) Adds x86-simd-sort as a submodule to accelerate sorting for 32-bit and 64-bit datatypes when AVX2 or AVX512 are available. For contiguous data, this can be over a 10x speedup for large arrays. For discontiguous data, it can give over a 4x speedup with larger arrays. These benchmarks were gathered on a Skylake system (7900x), limited to 8 threads. <details> <summary><b>Contiguous Benchmarks</b></summary> ``` float32, normally distributed (in microseconds) size Default AVX2 AVX512 Default/AVX2 Default/AVX512 16 7.150844336 6.886271477 7.132277489 1.038420335 1.002603214 128 9.208030939 8.478154898 7.846915245 1.086089019 1.173458697 1024 37.79037627 23.60707456 16.44122627 1.600807257 2.298513241 10000 714.7355628 203.9921844 105.5683001 3.503739934 6.770361577 100000 8383.074408 721.6333354 465.3709247 11.61680593 18.01374766 1000000 97124.31945 5632.054572 3920.148401 17.24491803 24.77567416 10000000 1161974.907 86070.48988 71533.82301 13.50027063 16.24371323 int32_t, uniformly distributed (in microseconds) size Default AVX2 AVX512 Default/AVX2 Default/AVX512 16 7.203208685 6.92212224 7.014458179 1.040606975 1.026908779 128 8.972388983 8.195516348 7.592543125 1.094792396 1.18173698 1024 32.77489477 23.6874548 15.36617105 1.383639359 2.132925285 10000 607.8824128 193.3402024 99.25090471 3.144107667 6.124703997 100000 523.9384684 608.1836536 442.3166784 0.861480682 1.184532472 1000000 5211.348627 5271.598405 3518.861883 0.988570871 1.480975611 10000000 133853.6263 81463.05084 67852.97394 1.643120714 1.972700952 ``` </details> Note that the int32_t sort is accelerated by FBGEMM's radix sort for larger arrays, but this only handles contiguous data and in one sorting direction. <details> <summary><b>Discontiguous Benchmarks</b></summary> ``` float, normal distributed, discontiguous in sorted dimension (in microseconds) size Default AVX2 AVX512 Default/AVX2 Default/AVX512 16 3.836543679 4.011214256 3.84376061 0.956454439 0.99812243 128 5.755310194 5.755723127 4.820394962 0.999928257 1.193949923 1024 49.46946019 24.78790785 15.47874362 1.995709379 3.195960952 10000 665.2505291 236.6165959 143.9490662 2.811512551 4.621429974 100000 4328.002203 1329.001212 818.3516414 3.256582586 5.288682743 1000000 47651.5018 16693.72045 11827.39551 2.854456677 4.028909133 10000000 556655.1288 236252.6258 184215.9828 2.356185998 3.021752621 int32_t, uniformly distributed, discontiguous in sorted dimension (in microseconds) size Default AVX2 AVX512 Default/AVX2 Default/AVX512 16 3.817994356 3.878117442 3.770039797 0.984496837 1.012719908 128 5.578731397 5.577152082 4.716770534 1.000283176 1.182743862 1024 43.3412619 23.61275801 14.55446819 1.835501887 2.977866408 10000 634.3997478 224.4322851 133.9518324 2.826686667 4.736028889 100000 4084.358152 1292.363303 781.7867576 3.16037924 5.22438902 1000000 46262.20465 16608.35284 11367.51817 2.785478192 4.06968381 10000000 541231.9104 235185.1861 180249.9294 2.301301028 3.002674742 ``` </details> Pull Request resolved: https://github.com/pytorch/pytorch/pull/127936 Approved by: https://github.com/jgong5, https://github.com/peterbell10, https://github.com/sanchitintel	2024-11-02 02:14:01 +00:00
Chen, Zejun	edd3f5a94d	[profiler] fix a building warning by adding USE_KINETO namespace for setTraceID (#139461 ) Fix: https://github.com/pytorch/pytorch/issues/139460 Pull Request resolved: https://github.com/pytorch/pytorch/pull/139461 Approved by: https://github.com/EikanWang, https://github.com/guangyey, https://github.com/sraikund16	2024-11-02 01:02:29 +00:00
Angela Yi	092fe2f422	Handle nan case when checking mutations (#139483 ) Test Plan: PT2 readiness models Differential Revision: D65340986 Pull Request resolved: https://github.com/pytorch/pytorch/pull/139483 Approved by: https://github.com/zou3519	2024-11-02 00:49:05 +00:00
William Wen	b71e813bce	[dynamo, 3.13] fix bytecode nop tests (#139323 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139323 Approved by: https://github.com/jansel	2024-11-02 00:39:36 +00:00
Bin Bao	8c17830dea	[AOTI] Unify how weights are stored as data section (#139471 ) Summary: https://github.com/pytorch/pytorch/pull/118076 introduced a cleaner way to link weights as a data section for macos. Unify the code by adopting that approach for Linux as well. Test Plan: CI Differential Revision: D65302273 Pull Request resolved: https://github.com/pytorch/pytorch/pull/139471 Approved by: https://github.com/chenyang78	2024-11-02 00:23:24 +00:00
PyTorch UpdateBot	aa54b2467f	[executorch hash update] update the pinned executorch hash (#139133 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned executorch hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139133 Approved by: https://github.com/pytorchbot	2024-11-02 00:14:47 +00:00
eellison	ee2f8a50d3	Class rename (#139490 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139490 Approved by: https://github.com/exclamaforte, https://github.com/zou3519 ghstack dependencies: #139295	2024-11-02 00:10:17 +00:00
PyTorch MergeBot	c95adb9c5b	Revert "use more elements per thread for narrow dtypes (#139449 )" This reverts commit f5b9e725d14a9a2906b7f1701d97cb4e95891a92. Reverted https://github.com/pytorch/pytorch/pull/139449 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but a bunch of tests are failing after it lands, it looks like a landrace ([comment](https://github.com/pytorch/pytorch/pull/139449#issuecomment-2452723863))	2024-11-01 23:42:16 +00:00
PyTorch MergeBot	b617d4813c	Revert "fix dynamo tracking numpy 2 ops (#138686 )" This reverts commit 124eac255e3af04379721af09631a45a05c7fb05. Reverted https://github.com/pytorch/pytorch/pull/138686 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but I am seeing inductor failure with hf_BigBird number of graph breaks after it lands ([comment](https://github.com/pytorch/pytorch/pull/138686#issuecomment-2452718164))	2024-11-01 23:34:06 +00:00
Nikita Shulga	77b72d686e	[BE][MPS] Make metal shaders compile cleanly (#139522 ) I.e. without warnings, by deleting dead code and fixing one signed-unsigned comparison warning Also, pass `-Werror` to metal compiler if WERROR options is set Pull Request resolved: https://github.com/pytorch/pytorch/pull/139522 Approved by: https://github.com/Skylion007	2024-11-01 23:22:47 +00:00
eellison	2382b3b6d8	[Easy] Add joint graph passes, fallback_random to bisector (#139295 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139295 Approved by: https://github.com/zou3519, https://github.com/exclamaforte	2024-11-01 23:21:53 +00:00
Gabriel Ferns	1e73842029	Refactor FxGraphDrawer to use HTML-like labels (#137726 ) Fixes https://github.com/pytorch/pytorch/issues/137499 Testing: Added a new unit test to make sure that the regression case succeeds. I'm debating about whether to make the borders visible. I'm partial to no borders, but it might make it harder for some people to read? ![68a2b0e3-orig_fx_graph_diagram](https://github.com/user-attachments/assets/fbc2fd98-9e76-488e-8ebe-c64fbf206932) Vs. ![2bfe1c4f-orig_fx_graph_diagram](https://github.com/user-attachments/assets/b6bc88ba-dda2-4cf7-84ac-a615e1e03a74) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137726 Approved by: https://github.com/eellison, https://github.com/malfet	2024-11-01 23:19:50 +00:00
David Berard	60542eeb33	[inductor] set sanitize_overflow=False for triton kernels (#139502 ) In upstream triton, https://github.com/triton-lang/triton/pull/4589 introduces overflow checks. However, overflow checks likely add some overhead, and have some correctness bugs at the moment (e.g. https://github.com/triton-lang/triton/pull/5033). Let's set `sanitize_overflow=False` but keep `debug=True` so that we can keep using device_assert but without the additional asserts added by `sanitize_overflow`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139502 Approved by: https://github.com/bertmaher	2024-11-01 23:10:21 +00:00
Huy Do	da395384a2	Delete Windows GPU jobs in periodic (#139336 ) As an outcome of https://fburl.com/gdoc/voce5o06, we could stop running Windows GPU tests on periodic pending the green light from MS. No one is monitoring these jobs atm. We already have Windows CUDA and CPU build jobs in trunk. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139336 Approved by: https://github.com/ZainRizvi, https://github.com/wdvr, https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2024-11-01 22:26:22 +00:00
Shuqiang Zhang	4c64a7f33f	[pgnccl] add a restart test for PGs in blocking mode (#139496 ) Summary: Restarting (aborting and re-initialize a PG) is a basic need if we want to achieve in-process restart of PGs without tearing down the whole process. Add this tests to verify that this is supported by current NCCL. Note that this restart test passes steadily only for blocking mode for now. In nonblockin mode. There is problem in either nccl init or abort that needs further investigation Test Plan: new UT Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/139496 Approved by: https://github.com/c-p-i-o, https://github.com/kwen2501	2024-11-01 22:13:37 +00:00
Huy Do	0b13bdd877	Delete parallelnative jobs in periodic (#139328 ) As an outcome of https://fburl.com/gdoc/voce5o06, we can now clean up parallelnative build and test jobs in periodic. There is not much value in running them anymore Pull Request resolved: https://github.com/pytorch/pytorch/pull/139328 Approved by: https://github.com/wdvr, https://github.com/malfet	2024-11-01 22:05:13 +00:00
Huy Do	8eb75cbad6	Delete iOS jobs from periodic (#139345 ) As an outcome of https://fburl.com/gdoc/voce5o06 and confirm with @iseeyuan, we can now clean up iOS lite interpreter jobs on PyTorch CI. There is not much value in running them anymore. It's stated in https://github.com/pytorch/ios-demo-app/blob/master/README.md that ExecuTorch is the replacement now. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139345 Approved by: https://github.com/wdvr, https://github.com/malfet	2024-11-01 22:04:27 +00:00
Huy Do	8ad76efb8d	Delete Vulkan jobs from periodic (#139354 ) As an outcome of https://fburl.com/gdoc/voce5o06, we can clean up this job now as the backend has been marked as deprecated https://pytorch.org/tutorials/prototype/vulkan_workflow.html to be replace by ExecuTorch Vulkan delegate. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139354 Approved by: https://github.com/wdvr, https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2024-11-01 22:03:12 +00:00
Mikayla Gawarecki	a979318ef7	Add section to serialization note re weights_only (#139433 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139433 Approved by: https://github.com/malfet ghstack dependencies: #138936, #139221	2024-11-01 21:51:50 +00:00
Nikita Shulga	a1f854f270	[MPS] Compile kernels into Metallib (#138636 ) PyTorch MPS backend for the most part relies on MPSGraph to provide specific operations, but recently more and more often one had to implement custom kernel here that were simply embedded in the operator codebase and were compiled directly using [`- id<MTLLibrary>newLibraryWithSource:options:error:`](https://developer.apple.com/documentation/metal/mtldevice/1433431-newlibrarywithsource) (first metal kernel to MPS backend was added in https://github.com/pytorch/pytorch/pull/82307 ) Later on, as number of operator grew, those were refactored into `MetalShaderLibrary` convenience class (see https://github.com/pytorch/pytorch/pull/125550 ) But as number of kernels keeps growing, it's time to make a next step and properly compile them into `.metalib` This PR does exactly that by: - Moving shader sources into separate .metal files - Adds check on whether full Xcode installed or just DeveloperTools - If full Xcode is installed, compiles and links shaders into .metallib for Metal-3.0(Available on MacOS 13) and Metal-3.1 standard (available on MacOS 14, can use bfloat) and bundles both using `-sectcreate` linker option and `getsectiondata` API call. `metallib_dummy.cpp` file is used to properly express dependencies between metallib build and torch_cpu link stages. Logic for generating metallibraries is loosely based on https://github.com/ml-explore/mlx/blob/main/mlx/backend/metal/kernels/CMakeLists.txt. - If only DeveloperTools CLI is installed, automatically wraps .metal into `_metallib.h` that contains shader source wrapped in `MetalShaderLibrary` Bulk of changes introduced in this PR are just moving code around. I.e. for every file that contains non-templated shader definition in `aten/src/ATen/native/mps/operators` folder, corresponding `.metal` file is created in `aten/src/ATen/native/mps/kernels` folder and embedded shader definition is replaced with the following ```cpp #ifndef PYTORCH_JIT_COMPILE_SHADERS static auto& lib = MetalShaderLibrary::getBundledLibrary(); #else #include <ATen/native/mps/OpName_metallib.h> #endif ``` Some historical stats: \| PyTorch Version \| Number of shaders in MPS \| Ops added \| \| ------------- \| ------------- \| ---- \| \| 1.12 \| 0 \| \| \| 1.13 \| 2 \| bitwise_ops and index.out \| \| 2.0 \| 4 \| cross repeat and view) \| \| 2.1 \| 9 \| unary_ops, histogram, renorm, binary_ops \| \| 2.2 \| 11 \| gamma and bucketization \| \| 2.3 \| 12 \| naive_matmul (to workaround crash) \| \| 2.4 \| 13 \| quantized_mm \| \| 2.5 \| 14 \| fused_adam \| Pros: - Better code structure/readability - Eventually allows one to use shared headers (and implement something like `TensorIterator`) - Faster runtime (as compilation is done ahead of time) and perhaps better optimized compiled kernels Cons: - Build process is a bit more complicated that it used to be - Need to maintain two codepath (as our CI builders only has DeveloperTools installed) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138636 Approved by: https://github.com/manuelcandales	2024-11-01 21:47:20 +00:00
Edward Z. Yang	a6630bcf87	Profile guided optimization for automatic_dynamic (#139001 ) Previously: https://github.com/pytorch/pytorch/pull/138052 but the implementation is done from scratch, so I open a new PR. This implements the ability to save and load profiles of automatic dynamic decisions, so on subsequent runs we can directly make something automatically dynamic. Unlike the previous implementation, this cache is never enabled by default; instead, you have to specify a "job id" that says it's OK to share results. We will be able to automatically populate this id for internal MAST jobs but for generic OSS users you will have to explicitly opt into it. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Differential Revision: [D65065497](https://our.internmc.facebook.com/intern/diff/D65065497) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139001 Approved by: https://github.com/oulgen	2024-11-01 21:43:25 +00:00
Xuan Zhang	9c2ffce71a	add condition for freeable input buffer (#139480 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139480 Approved by: https://github.com/yf225 ghstack dependencies: #139396	2024-11-01 21:15:40 +00:00
Huy Do	18f3b3c991	Clean up Android jobs in CI (#139350 ) As an outcome of https://fburl.com/gdoc/voce5o06 and confirm with @iseeyuan, we can now clean up Android lite interpreter jobs on PyTorch CI. There is not much value in running them anymore. It's stated in https://github.com/pytorch/android-demo-app/blob/master/README.md that ExecuTorch is the replacement now. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139350 Approved by: https://github.com/ZainRizvi	2024-11-01 21:10:19 +00:00
Sam Larsen	c412a42ae2	[pt2 logging] move remote cache get/put logging up one level (#139423 ) Summary: I need to refactor the way we record CompilationMetrics. It will be much easier to do in OSS and having the relevant timing code in the OSS area of the codebase will make this much easier. I doubt this meaningfully changes the values we see. Test Plan: Made sure samples show up: https://fburl.com/scuba/dynamo_compile/sandbox/c38zjq0x Differential Revision: D65290089 Pull Request resolved: https://github.com/pytorch/pytorch/pull/139423 Approved by: https://github.com/oulgen	2024-11-01 21:06:59 +00:00
Animesh Jain	0e57f2b589	[invoke_subgraph] Change the joint_graph output signature to simplify min-cut partitioner (#139326 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139326 Approved by: https://github.com/zou3519 ghstack dependencies: #139216, #139130	2024-11-01 21:02:32 +00:00
Animesh Jain	6a268c3fbb	[invoke_subgraph] Generate fake_inputs correctly (#139130 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139130 Approved by: https://github.com/zou3519 ghstack dependencies: #139216	2024-11-01 21:02:32 +00:00
Animesh Jain	4c756cacfd	[invoke_subgraph] Re-enable fake tensor model in the fake tensor impl (#139216 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139216 Approved by: https://github.com/zou3519	2024-11-01 21:02:32 +00:00
Justin Chu	5d67efb809	[ONNX] New registration API (#135403 ) The ONNX custom ops registration API. ## Design 1. Create a "custom_translation_table: dict[Callable, Sequence[Callable] \| Callable" parameter for specifying extra functions 2. Use a callable as the key to support all possible call_function targets in the fx graph 3. Allow a callable or a Sequence of callables as values. - When there is a single callable, it is the translation function for the op - When there is a Sequence of callable, the exporter's dispatcher will dispatch to these callables in order based on input dtypes. - The translation functions can be a plain python function that calls onnxscript ops (traced), or an onnxscript function. - Complex input support: We create special type annotations for annotating real representations of complex inputs, which are needed to handle complex computation in the ONNX graph, as we don't have any ops in ONNX that handle complex inputs. The dispatcher will have knowledge of these newly created type annotations and dispatch correctly. The complex functions will be in the same overload pool as the real functions. ```py torch.onnx.export(dynamo=True, custom_translation_table = { torch.ops.aten.add: [overload1, overload2], torch.sym_not: sym_not_onnx, }) ``` Support for functions that handles complex inputs will be in separate PRs. fixes https://github.com/pytorch/pytorch/issues/138391 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135403 Approved by: https://github.com/titaiwangms	2024-11-01 20:58:54 +00:00
Natalia Gimelshein	f5b9e725d1	use more elements per thread for narrow dtypes (#139449 ) Fix perf issue for narrow type by accessing more elements per thread Pull Request resolved: https://github.com/pytorch/pytorch/pull/139449 Approved by: https://github.com/Chillee, https://github.com/eqy	2024-11-01 20:41:13 +00:00
Jason Ansel	73c0762a34	[inductor] Simplify remove_kernel_local_buffers (#139452 ) I plan to reuse `can_buffer_be_removed_through_fusion` in some heuristics. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139452 Approved by: https://github.com/shunting314 ghstack dependencies: #139364, #139365, #139370	2024-11-01 20:36:39 +00:00
Bert Maher	dcdcb8b364	Avoid overflow in float32-to-int32 test (#139489 ) Summary: Triton has added some integer overflow detection when kernels are compiled with `debug=True`, and this test results in integer overflow (2.0 is 0x40000000, times 2 is 0x80000000 which overflows a signed int32). Assertion `int32 overflow detected for operation mul` failed Fixes #139479 Test Plan: ``` python inductor/test_torchinductor.py -k test_float32_to_int32_cuda ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/139489 Approved by: https://github.com/eellison, https://github.com/jansel, https://github.com/chenyang78	2024-11-01 20:22:19 +00:00
Yifu Wang	0dbc284a72	[SymmetricMemory] expose signal_pads as tensors in Python (#138754 ) ## This Stack This stack does the following things to support `xformers`-style, comm-aware Triton kernels: - Exposes `signal_pad`s as tensors in Python - Adds a binding for `cuMemsetAsync` These in combination aims to provide users with more flexibility to express custom signaling/synchronization patterns. ## This PR ```python # Obtain the signal pad of the specified peer rank as a tensor. # If both shape and dtype are unspecified, the returned tensor will be a # 1d uint32 tensor, which is most natural for signaling purposes. symm_mem.get_signal_pad(peer_rank) # If only shape is specified, it is equivalent to: # symm_mem.get_signal_pad(peer_rank)[:shape.numel()].view(shape) symm_mem.get_signal_pad(peer_rank, shape) # If only dtype is specified, it is equivalent to: # symm_mem.get_signal_pad(peer_rank).view(dtype) symm_mem.get_signal_pad(peer_rank, dtype=dtype) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/138754 Approved by: https://github.com/weifengpy, https://github.com/lw	2024-11-01 20:17:15 +00:00
Haifeng Jin	124eac255e	fix dynamo tracking numpy 2 ops (#138686 ) Fixes #136559 As we upgrade to NumPy 2, torch falsely filtered out `numpy.random` as unsupported in dynamo tracking. This PR changes the filtering rules to include them while keeping behavior with numpy 1 unchanged. Before this PR, the following tests failed: ``` PYTORCH_TEST_WITH_ASAN=1 PYTORCH_TEST_WITH_UBSAN=1 python test/dynamo/test_functions.py -k FunctionTests.test_numpy_random PYTORCH_TEST_WITH_ASAN=1 PYTORCH_TEST_WITH_UBSAN=1 python test/dynamo/test_unspec.py -k UnspecTests.test_to_tensor PYTORCH_TEST_WITH_ASAN=1 PYTORCH_TEST_WITH_UBSAN=1 python test/test_fake_tensor.py -k FakeTensorTest.test_export_numpy PYTORCH_TEST_WITH_ASAN=1 PYTORCH_TEST_WITH_UBSAN=1 python test/test_fake_tensor.py -k PropagateRealTensorsFakeTensorTest.test_export_numpy_propagate_real_tensors ``` With this PR, the supported/unsupported ops in NumPy 1 are not changed. For NumPy 2, only the `numpy.random` ops that are already supported with NumPy 1 are added to the supported list. I used the following scripts to check the differences before and after the change for both NumPy 1 & 2. The output is empty for NumPy 1 since there is no change. The output is a list of `numpy.random` that considered supported for NumPy 2. ```py from torch._dynamo import trace_rules import numpy as np def new_numpy_function_ids(): unsupported_funcs = {"seed", "ranf", "get_bit_generator", "RandomState", "set_bit_generator", "sample"} def is_supported(k, v, mod): if not callable(v): return False if not getattr(v, "__module__", None): return True if v.__module__ == mod.__name__: return True if v.__module__ == "numpy.random.mtrand" and mod.__name__== "numpy.random" and k not in unsupported_funcs: return True return False rv = {} for mod in trace_rules.NP_SUPPORTED_MODULES: for k, v in mod.__dict__.items(): if is_supported(k, v, mod): rv[id(v)] = f"{mod.__name__}.{k}" return rv def old_numpy_function_ids(): rv = {} for mod in trace_rules.NP_SUPPORTED_MODULES: rv.update( { id(v): f"{mod.__name__}.{k}" for k, v in mod.__dict__.items() if callable(v) and (getattr(v, "__module__", None) or mod.__name__) == mod.__name__ } ) return rv rv1 = set(old_numpy_function_ids().values()) rv2 = set(new_numpy_function_ids().values()) for v in (rv1 - rv2): print(v) print("****") for v in (rv2 - rv1): print(v) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/138686 Approved by: https://github.com/lezcano, https://github.com/williamwen42	2024-11-01 19:51:40 +00:00
Mikayla Gawarecki	ea0e09b3f3	Add utility to get all unsafe globals in checkpoint (no pickletools dependency) (#139221 ) Fixes https://github.com/pytorch/pytorch/issues/129698 https://github.com/pytorch/pytorch/pull/139106 without pickletools Pull Request resolved: https://github.com/pytorch/pytorch/pull/139221 Approved by: https://github.com/malfet ghstack dependencies: #138936	2024-11-01 19:31:39 +00:00
rzou	f3b485eb2a	[reland] Flip triton kernel default layout constraint to "needs_fixed_stride_order" (#137064 ) This is to match the default layout constraint for custom operators. By default, Inductor should match the stride order of inputs to a triton kernel. IF THIS IS BREAKING YOU, PLEASE REACH OUT, especially if it's been more than two weeks since this landed. You can flip the config locally as a workaround. Test Plan: - existing tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/137064 Approved by: https://github.com/albanD, https://github.com/eellison	2024-11-01 19:21:16 +00:00
Colin L. Rice	abc5d59dcb	config: create Config objects with JK support (#138766 ) This teaches install_config_module (and the underlying code) to understands Config objects. Additionally we've added a JK option to this which resolves the JK. This config gets stored within the _ConfigEntry class and is evaluated when __getattr__ is called. If justknobs is set, it'll call justknobs_check to see the result. Due to preceeding work, basically everything works correctly here and we had to update a couple of tests, and modify the getattr behaviour. Note that we are updating the justknob_check function to support a default option, to make default work. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138766 Approved by: https://github.com/ezyang	2024-11-01 19:20:37 +00:00
eqy	6fc63b4ef1	[ROCM][CUDA][NCCL] Disable `test_lowering_one_shot_all_reduce` on ROCM (#139414 ) I'm not sure this is expected to run if it requires buffer-registration support CC @yifuwang @huydhn @syed-ahmed #138029 Pull Request resolved: https://github.com/pytorch/pytorch/pull/139414 Approved by: https://github.com/huydhn, https://github.com/yifuwang	2024-11-01 18:39:47 +00:00
Jason Davies	391ee62180	Ensure scalar tensor device matches attn_mask for convert_boolean_attn_mask_cudnn. (#139450 ) This is causing a small performance hit when using SDPA with the cuDNN backend due to unnecessary host-to-device memcpy. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139450 Approved by: https://github.com/drisspg, https://github.com/eqy	2024-11-01 18:38:02 +00:00
Sam Larsen	d8b606ecb5	[fx graph cache] Support freezing with FX graph caching (#136505 ) Summary: The main changes to support freezing are: 1) When pickling constant tensors as part of the cache key calculation: If freezing has not been applied, then keep the existing behavior (pickle the metadata and values). If freezing has been applied, then pickle the values if the constant will be inlined; otherwise, consider only the metadata. 2) If freezing has been applied, modify what we store in the cache: Instead of storing the constant attributes in the cache entry, store the _names_ of the constants, and then grab those constants from the GraphModule when we need attache the attributes to a newly-loaded Python module. Since the cache lookup path loads the Python module, this bullet means we need to thread through a GraphModule argument in several places. 3) Since this feature means that we may need to reload the same Python module path more than once (but attach different constant attributes), I changed PyCodeCache.load_by_key_path to not store an in-memory map of path to module (since there may be more than one). I don't _think_ this will have any affect on performance, however.. It's unclear why we were using an in-memory cache here anyway, since this function should only be called once for each module needed to be loaded. 4) Several tests were removing on-disk PyCodeCache artifacts by iterating over the modules. I made this more straightforward by implementing a cache_clear method that removes the on-disk artifacts. Arguably, this should have been the implementation all along. Differential Revision: [D63542170](https://our.internmc.facebook.com/intern/diff/D63542170) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136505 Approved by: https://github.com/eellison	2024-11-01 18:29:29 +00:00
vladimirrotariu	7d644f025f	make equation behind torch.isclose element-wise (#138459 ) The current formula behind torch.isclose, according to the docs, is ![imagen](https://github.com/user-attachments/assets/6b79f6d8-e675-4585-b26b-0c6933f7ecdd) However, torch.isclose acts element-wise, so this formula may be misleading at first, given that the docs said that `input` and `other` are the first, respectively second tensor to compare. I propose the following change, to stress the element-wise nature of the norms in the equation: ![imagen](https://github.com/user-attachments/assets/2926a1c6-c4fa-4c48-8874-106521d3f54c) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138459 Approved by: https://github.com/soulitzer	2024-11-01 18:18:33 +00:00
Nikita Shulga	1857be1b48	Fix S390 builds (#139491 ) Caused by https://github.com/pytorch/pytorch/pull/137918 By guarding all cpuinfo use with `!defined(__s390x__ ) && !defined(__powerpc__)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/139491 Approved by: https://github.com/huydhn, https://github.com/Skylion007	2024-11-01 18:16:29 +00:00
Nikita Shulga	51adab0829	[MPS] Fix reduction ops outputs for empty tensors (#139446 ) By adding a switch for all reduction types, that either sets it to given value or raises runtime error. Before this change, reduction ops returned uninitialized values in many case Fixes https://github.com/pytorch/pytorch/issues/139400 Pull Request resolved: https://github.com/pytorch/pytorch/pull/139446 Approved by: https://github.com/Skylion007	2024-11-01 17:32:12 +00:00
Bin Bao	7d081cabfb	[AOTI] Forward fix #139458 (#139485 ) Summary: A new test added in https://github.com/pytorch/pytorch/pull/139458 only fails in certain CI instance. Skip for now as the failing test has a low priority. @diff-train-skip-merge (to silent fb bot so that I can land this myself) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139485 Approved by: https://github.com/huydhn, https://github.com/hl475	2024-11-01 17:14:40 +00:00
Scott Wolchok	3e0f4d18eb	[PyTorch] Support non-zero beta in fp16_gemv_trans (#138275 ) No real reason to have the zero-beta restriction, so let's lift it. Testing: intentionally broke new paths locally to verify test coverage existed Differential Revision: [D64407752](https://our.internmc.facebook.com/intern/diff/D64407752/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138275 Approved by: https://github.com/malfet ghstack dependencies: #139082, #139083, #137918, #138005	2024-11-01 16:49:05 +00:00
Scott Wolchok	195b1b9a9b	[PyTorch] Hook up fp16_gemv_trans to gemv fast path for non-aarch64 architectures (#138005 ) Following up on previous rev to use fp16_gemv_trans in gemv, not just gemm-used-for-gemv. Differential Revision: [D64351092](https://our.internmc.facebook.com/intern/diff/D64351092/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138005 Approved by: https://github.com/malfet ghstack dependencies: #139082, #139083, #137918	2024-11-01 16:49:05 +00:00
Scott Wolchok	fad5d89321	[PyTorch] Hook up fp16_gemv_trans to x86 fp16 GEMM (#137918 ) This is the first big milestone we've been building towards! (Following rev also hooks this up to actual gemv.) Testing: To check perf, I ran python torchchat.py generate stories110M --dtype fp16 --device cpu on an x86 machine without AVX512FP16. Observed roughly 5x tokens/sec increase. Differential Revision: [D64280688](https://our.internmc.facebook.com/intern/diff/D64280688/) NOTE FOR REVIEWERS: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D64280688/)! Pull Request resolved: https://github.com/pytorch/pytorch/pull/137918 Approved by: https://github.com/malfet ghstack dependencies: #139082, #139083	2024-11-01 16:48:56 +00:00
Scott Wolchok	d79c5143d8	[PyTorch] Add efficient isnan for NEON half (#139083 ) Same as the efficient one for float when f16 hardware support is available. Testing: Added exhaustive isnan test coverage Differential Revision: [D65003321](https://our.internmc.facebook.com/intern/diff/D65003321/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139083 Approved by: https://github.com/malfet ghstack dependencies: #139082	2024-11-01 16:40:51 +00:00
Scott Wolchok	9ecd7d1587	[PyTorch] Add efficient isnan for NEON float (#139082 ) Just test x != x rather than applying element-by-element scalar isnan. Testing: vec_test_all_types checks IsNan Differential Revision: [D65001633](https://our.internmc.facebook.com/intern/diff/D65001633/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139082 Approved by: https://github.com/malfet	2024-11-01 16:40:51 +00:00
sanchitintel	3cbf0c0bbf	[Inductor][CPP] Cache weight tiles in L1D for AMX int8 WoQ GEMM (#136688 ) # Summary The AMX ISA based GEMM micro-kernel template for int8 weight-only quantization (BF16 activation, int8 weights) should cache dequantized weights (int8 -> int32 -> fp32 -> bf16) so that they would not have to be dequantized again in subsequent calls to the _inner-kernel_ that uses the same weights. This change leverages the fact that even for BF16 x BF16 GEMM template, cache-blocking ensures that `Nr * Kc` weight elements are cached in L1D cache (more info [here](https://static.sched.com/hosted_files/pytorch2024/59/TorchInductor%20CPU%20Backend%20Advancements%20-%20New%20Features%20and%20Performance%20Improvements_20240915.pdf)). Here, `Nr` is the register blocking size for `N` dimension (at the granularity of the GEMM micro-kernel, it's currently also the cache blocking size for `N` dimension, although that may change in the future), and `Kc` is the cache blocking size for `K` dimension. The figure below is from the document linked above - <img width="476" alt="image" src="https://github.com/user-attachments/assets/e23e5476-d910-46d1-a9b3-cbf77de76d94"> ## Performance data Collected on 48 physical cores of one socket of Intel Xeon Platinum 8468H (Xeon SP 4th gen). Intel OpenMP & tcmalloc were preloaded. \|M \| N \| K \| Latency with ATen _weight_int8pack_mm \| Latency with codegened templated GEMM (current main branch) \| Latency with codegened templated GEMM (this PR) \| \|-----\|-----\|-----\|------\|----------\|----\| \|4096\|4096\|4096\| 45.844 ms \| 9.322 ms\| 5.2181 ms \| \|4096\|11008\|4096\| 127.618 ms \|24.6258 ms \| 13.6046 ms\| \|4096\|4096\|11008\| 121.953 ms \| 25.4692 ms \| 10.2669 ms \| \|4096\|32000\|4096\| 478.450 ms\| 75.3942 ms \| 48.21 ms \| Pull Request resolved: https://github.com/pytorch/pytorch/pull/136688 Approved by: https://github.com/jgong5	2024-11-01 16:32:22 +00:00
Jason Ansel	b57b4b7f9b	[inductor] Move remove_kernel_local_buffers to Kernel (#139370 ) This method mutates the kernel, so it fits better in that class. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139370 Approved by: https://github.com/shunting314 ghstack dependencies: #139364, #139365	2024-11-01 16:28:15 +00:00
Jason Ansel	1e934b473c	[inductor] Remove Node.last_usage mutation (#139365 ) I can't figure out why this is needed. Let's see if tests fail. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139365 Approved by: https://github.com/shunting314 ghstack dependencies: #139364	2024-11-01 16:28:15 +00:00
Jason Ansel	286d3ce266	[inductor] Remove SIMDKernel.last_usage (#139364 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139364 Approved by: https://github.com/eellison, https://github.com/shunting314	2024-11-01 16:28:15 +00:00
Shuqiang Zhang	df0c1eceb9	[pgnccl][simple] clean up unused members of PGNCCL (#139436 ) Summary: Found those unused members when prototying something else. Better remove unused members Test Plan: CI Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/139436 Approved by: https://github.com/Skylion007	2024-11-01 16:25:04 +00:00
Bin Bao	33dce10ece	[AOTI][reland] Update zero size computation in clone_preserve_strides (#139458 ) Summary: Reland https://github.com/pytorch/pytorch/pull/139224. clone_preserve_strides implemented in _inductor/utils.py does not handle multi-dimensional 0-size tensor correctly. Differential Revision: [D65317451](https://our.internmc.facebook.com/intern/diff/D65317451) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139458 Approved by: https://github.com/hl475	2024-11-01 13:51:02 +00:00
Huy Do	560a0704c5	Use a different test name for testConversionToStringView (#139448 ) Summary: The change comes from D65214804 (https://github.com/pytorch/pytorch/pull/139239) `buck2 test @//fbobjc/mode/buck2/ios-tests fbsource//xplat/caffe2/c10:c10_testApple` doesn't like having 2 `testConversionToString` in the same suite `StringViewTest`, so just need to use a different name there. Test Plan: `buck2 test @//fbobjc/mode/buck2/ios-tests fbsource//xplat/caffe2/c10:c10_testApple` passes Differential Revision: D65314266 Pull Request resolved: https://github.com/pytorch/pytorch/pull/139448 Approved by: https://github.com/cyyever, https://github.com/malfet	2024-11-01 13:25:16 +00:00
Yifu Wang	e6e140c3d7	[Inductor] fix a compilation time regression caused by user-visible output handling (#139420 ) This PR fixes a compilation time regression manifested in timm_models/hrnet_w18 caused by https://github.com/pytorch/pytorch/pull/136732. The regression is reproducible locally. The compilation time is a bit noisy, but it's still possible to tell the difference. ``` Before the offending PR compilation_latency mean=176.022 seconds compilation_latency mean=176.564 seconds On the offending PR compilation_latency mean=180.096 seconds compilation_latency mean=179.101 seconds On the fix compilation_latency mean=173.153 seconds compilation_latency mean=174.182 seconds ``` (I think the fix being faster than the baseline is due to noise) The cause of the regression is an inefficiency in `is_user_visible_output()`. Specifically, it used `output_node.args[0].index(node)` to obtain the output idx for each node (and we called this for each node twice). The offending PR had the assumption that `len(output_node.args[0])` is rather small. However, it has been proven false by the benchmark (it was 1900+ for timm_models/hrnet_w18). The fix is to precompute `user_visible_output_strides` once by iterating only over the nodes in `output_node.args[0]`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139420 Approved by: https://github.com/ezyang	2024-11-01 08:27:40 +00:00
Yang Wang	307ee7926e	[Workflow][1/3] Remove benchmack tests from rerun disbled tests (#139337 ) Fixes [#5774](https://github.com/pytorch/test-infra/issues/5774) # Overview Remove benchmark tests from rerun-disabled-tests, this is considered non-unittest. See one page doc: [[Bootcamp Task] Remove non-unittest test during rerun-disabled-tests](https://docs.google.com/document/d/1xffkt_LNC5ZLsoVQDmuKbNqYnMUW_xYYStv66Pr-qac/edit?tab=t.0) # Manual Test - Test run Inductor.yml: https://github.com/pytorch/pytorch/actions/runs/11603287758/job/32309968542?pr=139337 - Test run inductor-unittest.yml ([3cbd83d](`3cbd83d3d5`)) https://github.com/pytorch/pytorch/actions/runs/11605399925/job/32315737205?pr=139337 # Steps to fix the issue - [x] [THIS PR] Create inductor-unittest.yml to handle unit test and daily rerun for inductor - [ ] Create Inductor-cu124-unittest.yml to handle unit tests and daily rerun for inductor-cu124 - [ ] Disable benchmark test in mixed test such as CPP_Wrapper which includes both unittest and benchmark test Pull Request resolved: https://github.com/pytorch/pytorch/pull/139337 Approved by: https://github.com/huydhn	2024-11-01 08:23:51 +00:00
Yang Wang	f7407b3de0	[Workflow][2/3] Remove benchmack tests from rerun disbled test (#139407 ) Fixes [#5774](https://github.com/pytorch/test-infra/issues/5774) # Overview Remove benchmark tests from rerun-disabled-tests, this is considered non-unittest. See one page doc: [[Bootcamp Task] Remove non-unittest test during rerun-disabled-tests](https://docs.google.com/document/d/1xffkt_LNC5ZLsoVQDmuKbNqYnMUW_xYYStv66Pr-qac/edit?tab=t.0) # Steps to fix the issue - [ ] Create inductor-unittest.yml to handle unit test and daily rerun for inductor - [x] [THIS PR] Create Inductor-cu124-unittest.yml to handle unit tests and daily rerun for inductor-cu124 - [ ] Disable benchmark test in mixed test such as CPP_Wrapper which includes both unittest and benchmark test Pull Request resolved: https://github.com/pytorch/pytorch/pull/139407 Approved by: https://github.com/huydhn Co-authored-by: Huy Do <huydhn@gmail.com>	2024-11-01 08:09:31 +00:00
Shunting Zhang	5e4c8b671c	[inductor] loaf-fix (#139376 ) Fix https://github.com/pytorch/pytorch/issues/128063 . Now for this snippet ``` def f(x): y = torch.sum(torch.sum(x, dim=-1)) z = x / 10.0 z_t = z.t().contiguous().t() return y, z, z_t ``` Inductor could generate a single kernel for the first reduction and the two ponitwise kernels (if loop-ordering after fusion is enabled). And the generated kernel read `x` only ONCE. (with no proper handling, the two pointwise's may each access x once even if they are fused). The PR needs fix 2 subtile bugs regarding LOAF . 1. when we reorder loops for a FusedSchedulerNode, we check if each sub-node's sizes matches. But some node has sizes in `list` type (if its loop is not reordered) while others have its sizes in `tuple` type (if its loop is reordered). I could change the upstream code to uniformly use either `list` or `tuple`. But without strong enforcement, future code could break this. So I just convert sizes to uniform type before comparison. 2. We have a cache for tiling decisions of a BaseSchedulerNode. If we reorder loops for the node, we should invalidate the cache. Otherwise, a stale tiling decision can result in (very) bad kernel. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139376 Approved by: https://github.com/jansel, https://github.com/eellison	2024-11-01 07:54:32 +00:00
lingzhi98	39ec5a20ea	[Partitioner] Enumerate partitions by iterating partition ids (#136598 ) Currently, we get all partition id by iterating assignment whose size is same as the number of nodes in graph. But we can reach same results by iterating partitions_by_id whose size is much smaller than the nodes number. Assume the number of nodes is N, the number of partitions is P, the time complexity decrease from O(N * N) to O(N * P) after this patch. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136598 Approved by: https://github.com/tarun292 Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>	2024-11-01 07:42:36 +00:00
andras_matyassy	61df90e3f6	Add TORCHDYNAMO_EXTENDED_ADVICE (#137159 ) (#137196 ) Fixes #137159 Happy to contribute to this project for the first time. If I missed any contribution guidelines, please let me know, I'm happy to adjust. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137196 Approved by: https://github.com/ezyang	2024-11-01 06:43:26 +00:00
angelayi	86db2cd194	[export] Initial draft export (#139383 ) Differential Revision: [D65288590](https://our.internmc.facebook.com/intern/diff/D65288590) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139383 Approved by: https://github.com/zou3519	2024-11-01 06:25:44 +00:00
FFFrog	300ca6368f	Remove depracated alias macro(2/3) (#137559 ) Detailed Descriptions: - Remove AT_ASSERTM Macro Pull Request resolved: https://github.com/pytorch/pytorch/pull/137559 Approved by: https://github.com/ezyang	2024-11-01 06:17:57 +00:00
William Wen	0c47657b05	[dynamo] ignore False/None callback in fail_on_recompile/force_backend stances (#139215 ) Fix https://github.com/pytorch/pytorch/issues/139202 Pull Request resolved: https://github.com/pytorch/pytorch/pull/139215 Approved by: https://github.com/jansel	2024-11-01 06:15:28 +00:00
cyy	4a2da52137	[1/N] Replace c10::sv with std::sv (#139453 ) Picks some safe replacements. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139453 Approved by: https://github.com/Skylion007	2024-11-01 05:39:37 +00:00
cyy	6ef6b3f586	Remove const fromDLPack overload (#139156 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/139156 Approved by: https://github.com/ezyang	2024-11-01 04:12:46 +00:00
Will Constable	84416618a6	[Pipelining] Update schedules to use I, B actions. (#138886 ) Also, update tests to use I (BACKWARD_INPUT) vs B (FULL_BACKWARD) consistently. Previously, schedules would issue a 'B' operation and leave it ambiguous whether that operation should be BACKWARD_INPUT or FULL_BACKWARD, depending on a separate flag (use_full_backward) passed to the schedule class, which would determine which behavior was taken at runtime. Now, use_full_backward is removed and the schedule class is required to produce unambiguous IR. The logic for 'use_full_backward' is removed from the runtime. _validate_pipeline_order is replaced with _simulate_comms_compute. Both offer similar functionality, to validate the corrrectness of a schedule IR. 'validate' operates on compute-only IR, while simulate operates on compute + comm IR. To convert from using validate to simulate, you have to first insert comm actions via '_add_send_recv'. 'simulate' was inefficiently written before this PR and needed to be optimized to run quickly for extra large schedules with >32 ranks and microbatches per rank used in some unit tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138886 Approved by: https://github.com/H-Huang	2024-11-01 03:54:06 +00:00
Bob Ren	094d288f40	Update tensorify pass to specialize symfloats we didn't tensorify away (#138868 ) As discussed w/ @ezyang offline, one way to de-risk the `specialize_float=False` rollout is to specialize all backed symfloats that we fail to tensorify away. This diff does a few things: 1) It fixes a bug where item_memo gets dropped (due to incorrect epoch invalidation) 2) It updates the tensorify pass to do the backup specialization This pass was originally part of the [PR](https://github.com/pytorch/pytorch/pull/137782) that flips `specialize_float=False` but we learned that the blast radius is simply too large. We've pivoted to a more milestone driven approach where we learn from the failures of the aforementioned PR and cherry pick fixes into main first. After this current PR lands our strategy is as follows: 1) Integrate turning off specialize float only in the automatic dynamic pass. 2) Put up a canary diff that only turns off specialize float in `backend=eager` mode to sniff out symfloat related bugs in dynamo due to code paths we previously never exercised. 3) Put up a canary diff that only turns off specialize float in `backend=aot_eager` mode to sniff out symfloat related bugs in aotautograd due to code paths we previously never exercised. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138868 Approved by: https://github.com/ezyang	2024-11-01 03:18:02 +00:00
James Wu	c8a648d4df	Add option to dynamo_timed and chromium_event_logger for logging pt2 compile events (#139309 ) This diff considerably changes the column format of PT2 Compile Events: - Now, instead of logging one new column per every piece of metadata, we just log a single column, "metadata". This vastly decreases the number of columns we need to log, which should help with retention. - Now, we only log to scuba for a set of dynamo_timed() events that we actually care about aggregating. To do so, we add a boolean to dynamo_timed() that decides whether or not to log a pt2_compile_event. We'll always log a chromium_event for every dynamo_timed(), but only log a subset of those to scuba. Differential Revision: [D65225598](https://our.internmc.facebook.com/intern/diff/D65225598/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139309 Approved by: https://github.com/oulgen	2024-11-01 02:40:25 +00:00
Yu, Guangye	46bca8a4b6	Export XPU oneDNN header to the public (#139177 ) # Motivation Export oneDNN header to the public, for example, the third-party extension now could use `GpuStreamManager` to manage `dnnl::stream` to submit oneDNN kernel. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139177 Approved by: https://github.com/gujinghui, https://github.com/EikanWang, https://github.com/malfet	2024-11-01 02:36:16 +00:00
Yang Wang	04382efe5e	[Bash][3/3] Remove benchmack tests from rerun disbled test (#139422 ) Fixes [#5774](https://github.com/pytorch/test-infra/issues/5774) # Overview Remove benchmark tests from rerun-disabled-tests, this is considered non-unittest. See one page doc: [[Bootcamp Task] Remove non-unittest test during rerun-disabled-tests](https://docs.google.com/document/d/1xffkt_LNC5ZLsoVQDmuKbNqYnMUW_xYYStv66Pr-qac/edit?tab=t.0) # Steps to fix the issue - [ ] Create inductor-unittest.yml to handle unit test and daily rerun for inductor - [ ] Create Inductor-cu124-unittest.yml to handle unit tests and daily rerun for inductor-cu124 - [x] Disable benchmark test in mixed test such as CPP_Wrapper which includes both unittest and benchmark test Pull Request resolved: https://github.com/pytorch/pytorch/pull/139422 Approved by: https://github.com/huydhn	2024-11-01 01:49:58 +00:00
Gabriel Ferns	030f70b40b	Allow inplacing buffer when other users are inconsequential (#138383 ) Summary: I think we can inplace a buffer if all of the users of said buffer are "inconsequential", defined as having been removed, being completed, or being part of the ancestors set. In particular, this allows LayerNorm to inplace its input buffer. Implements: https://github.com/pytorch/pytorch/issues/132826 Test Plan: New unit test of matmul followed by LayerNorm, make sure there's an inplaced buffer. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138383 Approved by: https://github.com/eellison	2024-11-01 01:24:40 +00:00
cyyever	8ace3e8023	Add sv starts/ends_with (#139261 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/139261 Approved by: https://github.com/Skylion007 Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>	2024-11-01 01:17:42 +00:00
Mikayla Gawarecki	2a309c0997	Fix weights_only for BUILD instructions for user allowlisted objects with __slots__ (#138936 ) Previously `BUILD` instruction missed handling for `__slots__`. This only applies for things allowlisted via `add_safe_globals`/`safe_globals` that use slots. ### Background When does pickle serialize a `BUILD` instruction? When `state` is not `None` and `state_setter` is `None` [[link](`c5b99f5c2c/Lib/pickle.py (L765)`)]. In this case, the docs tell us that either `__setstate__` or a `__dict__` update will be performed [[link](https://github.com/python/cpython/blob/3.13/Lib/pickletools.py#L1984)] `__reduce__`/`__reduce_ex__` are expected to return tuples of length 2 to 6 where `state` is the 3rd argument. When user doesn't patch `__reduce__` but patches `__setstate__`/`__getstate__`, state will be what is yielded by `__getstate__` Note the return type for [`__getstate__` ](https://docs.python.org/3/library/pickle.html#object.__getstate__) - For a class that has no instance [`__dict__`](https://docs.python.org/3/reference/datamodel.html#object.__dict__) and no [`__slots__`](https://docs.python.org/3/reference/datamodel.html#object.__slots__), the default state is None. - For a class that has an instance [`__dict__`](https://docs.python.org/3/reference/datamodel.html#object.__dict__) and no [`__slots__`](https://docs.python.org/3/reference/datamodel.html#object.__slots__), the default state is `self.__dict__`. - For a class that has an instance [`__dict__`](https://docs.python.org/3/reference/datamodel.html#object.__dict__) and [`__slots__`](https://docs.python.org/3/reference/datamodel.html#object.__slots__), the default state is a tuple consisting of two dictionaries: `self.__dict__`, and a dictionary mapping slot names to slot values. Only slots that have a value are included in the latter. - For a class that has [`__slots__`](https://docs.python.org/3/reference/datamodel.html#object.__slots__) and no instance [`__dict__`](https://docs.python.org/3/reference/datamodel.html#object.__dict__), the default state is a tuple whose first item is None and whose second item is a dictionary mapping slot names to slot values described in the previous bullet. see handling in pickle code `c5b99f5c2c/Lib/pickle.py (L1846-L1867)` Before this PR, we didn't account for the fact that when `__setstate__` is not defined, `state` might be a tuple so this would fail ```python from dataclasses import dataclass # Define the dataclass @dataclass class MyDataClass: __slots__ = ["x", "y"] x: int y: str # Create an instance of the dataclass my_data = MyDataClass(x=2, y=3) # Save the dataclass to a file torch.save(my_data, "my_data.pt") with torch.serialization.safe_globals([MyDataClass]): loaded_my_data = torch.load("my_data.pt", weights_only=True) # AttributeError: 'MyDataClass' object has no attribute '__dict__' ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/138936 Approved by: https://github.com/malfet	2024-11-01 00:59:29 +00:00
Jason Ansel	c2ffd41a86	[inductor] Enable AMD cooperative reduction tests (#139230 ) Fixes #139099 Pull Request resolved: https://github.com/pytorch/pytorch/pull/139230 Approved by: https://github.com/eellison	2024-11-01 00:55:13 +00:00
Jason Ansel	f9ef880c0b	[inductor] Refactor kernel args into SIMDKernelFeatures (#139327 ) This is a refactor PR to move stuff around. I'm planning to use the SIMDKernelFeatures class (in a future PR) to host new heuristics for selecting kernel types and block sizes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139327 Approved by: https://github.com/eellison, https://github.com/shunting314	2024-11-01 00:30:14 +00:00
PyTorch MergeBot	b6b9596607	Revert "[dynamo] Fix constant propagation in builtins and UserClasses (#131354 )" This reverts commit 44257c063e2f7bd9b35e6e4973f89d7f1cb65442. Reverted https://github.com/pytorch/pytorch/pull/131354 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but it seems to break some internal tests ([comment](https://github.com/pytorch/pytorch/pull/131354#issuecomment-2451050605))	2024-11-01 00:13:20 +00:00
IvanKobzarev	d33849908d	[aotd] Fuse tangents subclasses runtime traversals (#139068 ) Reason: Currently we have multiple traversals for tangents in runtime: - To check that types and structure are identical to what we guessed during tracing time - Coerce metadata - Coerce memory_format - Unwrap_tensor_subclass All of them are traversing tangents via __tensor_flatten__ calls the tree of Subclasses. Change: To do everything in one traversal at runtime (including flattening) Implementation details: Add memory_format information inside SubclassCreationMeta, for PlainTensors keep not only (int) of unwrapped_index, but memory_format too. Preparing memory_format is optional (controlled by with_memory_format=True). 2. Removing unused subclass_utils.create_metadata_for_subclass which does not have any usages inside torch and would require update of the logic. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139068 Approved by: https://github.com/bdhirsh	2024-11-01 00:03:02 +00:00
Xuan Zhang	86602a66d7	[orm] fix live_memory computation in lpmf algorithm (#139396 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139396 Approved by: https://github.com/yf225	2024-10-31 23:45:30 +00:00
PyTorch MergeBot	3d3551506d	Revert "[dynamo, 3.13] fix bytecode nop tests (#139323 )" This reverts commit c2d754441f8e941c208579661a04b5ed1e5e71bc. Reverted https://github.com/pytorch/pytorch/pull/139323 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it seems to cause a regression in instruction count metric ([comment](https://github.com/pytorch/pytorch/pull/139323#issuecomment-2451017609))	2024-10-31 23:34:00 +00:00
Chirag Pandya	6727f343b5	[c10d][fr][easy] Move check_no_missing_dump_files (#139417 ) Summary: Move check_no_missing_dump_files to after the "just print" location. This allows us to print dump_files when there are actual missing files. Test Plan: ``` torchfrtrace -j ~/pyper-training-online-924394600 --selected-ranks 1 2 Inferred common prefix nccl_trace_rank_ loaded 95 files in 0.040270328521728516s built groups, memberships Rank 1 Rank 2 ------------------------------------------------------------------ ------------------------------------------------------------------ broadcast(input_sizes=[[2]], state=completed) broadcast(input_sizes=[[2]], state=completed) ``` Without this change, the command was erroring out. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139417 Approved by: https://github.com/Skylion007, https://github.com/fduwjj	2024-10-31 22:55:01 +00:00
Will Constable	8e8040a5c2	[Pipelining] Optimize ready_to_schedule logic (#138924 ) Used in both simulator and add_send_recv pass, the ready_to_schedule logic works by looking at all the previously scheduled ops on a rank to see if any of them 'unblocks' the current op to be scheduled. For example, to schedule a FORWARD op, a previous RECV_F op is needed, unless this is stage 0 or there is a previous stage on the same rank that ran FORWARD already. The old implementation iteratively compared the candidate op to the previous ops. The new implementation uses set lookups to reduce complexity. It also maintains the set of previous ops as ops are scheduled rather than constructing a set on demand. I did not save benchmark results, but this results in a 10-100x speedup which is most noticeable for unit tests with artificially huge schedule IR, the largest of which took longer than 20m before (I never let it finish) but now takes less than 14s. Most schedules take less than 10ms. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138924 Approved by: https://github.com/H-Huang ghstack dependencies: #138928, #131762	2024-10-31 22:49:45 +00:00
Will Constable	c82e0d117a	[Pipelining] Support separate dI / dW and V-schedules (#131762 ) ### Separate dI / dW: PipelineScheduleRuntime now supports execution of merged FULL_BACKWARD or separate dI / dW operations. Separating the B and W may add execution overhead or may be suboptimal in cases where BW are 'fused', but it is worthwhile when separating B, W lets the schedule be more efficient by filling in bubbles. In some cases, the schedule will still issue B followed by W at certain points, so in these cases just merge them back into BW ops and execute them as full backwards rather than executing a B followed by a W. ### V-schedules: V-schedules have a special case where the last rank has 2 adjacent stages. E.g. if rank3 had stage 3 and stage 4, then we should implement direct transfer of stage3 outputs to stage4 inputs without a send/recv. In the schedling logic, we also must allow scheduling the stage 4 forward after running stage 3 forward, without expecting a stage 4 RECV_F In the runtime, we pass activations between adjacent stages without using SEND/RECV ops since the stages are on the same rank/process. We add new APIs to PipelineStage abstraction for passing the activations both during forward and backward. Currently the implementation directly modifies the 'recv buffers' the stage is managing, so the forward/backwrad execution logic does not need to know the difference. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131762 Approved by: https://github.com/H-Huang ghstack dependencies: #138928	2024-10-31 22:49:45 +00:00
Zhengxu Chen	45da80b970	reland D65167805 "[export] Update min_val and max_val to Optional[int] in serialization." (#139394 ) Summary: had a land racing with another diff D65166035 to fix the schema. According to export team's discussion, we are upgrading min_val and max_val to optional fields which shouldn't break BC and allows the schema to express infinity. Test Plan: buck2 test 'fbcode//mode/opt' fbcode//apf/rec/ir/tests:ir_export_deserialize_test Differential Revision: D65273170 Pull Request resolved: https://github.com/pytorch/pytorch/pull/139394 Approved by: https://github.com/yiming0416	2024-10-31 22:28:32 +00:00
Nikita Shulga	01136fb9e0	Update `MPS_ERROR_RUNTIME_TOO_LOW` message (#139427 ) https://github.com/pytorch/pytorch/pull/133141 updated min os requirement to 13.0, but missed the message Fixes https://github.com/pytorch/pytorch/issues/139425 Pull Request resolved: https://github.com/pytorch/pytorch/pull/139427 Approved by: https://github.com/seemethere, https://github.com/kit1980	2024-10-31 22:04:08 +00:00
Donald Tolley	c1e7d85ce6	Add Weighted Loss Functions to PyTorch : WMSE, WMAE, and Weighted Huber Loss (#132049 ) #### Summary This pull request introduces new weighted loss functions to the PyTorch library: `weighted_huber_loss`, `wmse_loss`, and `wmae_loss`. These functions allow for precise control over the influence of each sample during training, important for imbalanced data or when certain samples are more significant than others. #### Changes - `weighted_huber_loss`: Huber loss modified to incorporate weights, providing a balance between L1 and L2 loss based on the `delta` parameter. - `wmse_loss` (Weighted Mean Squared Error): Applies weights to the standard MSE loss, useful for emphasizing certain samples in regression tasks. - `wmae_loss` (Weighted Mean Absolute Error): Adjusts MAE loss calculation by including weights, ideal for datasets with outliers. #### Code Details - Input Validation: Ensures `input`, `target`, and `weights` tensors match in size to prevent broadcasting errors. - Reduction Options: Supports `none`, `mean`, and `sum` reductions to suit various computational needs. - Backward Compatibility: Maintains support for deprecated arguments `size_average` and `reduce`, while encouraging use of the `reduction` argument. #### Usage Example ```python import torch input = torch.tensor([0.5, 2.5, 2.0], dtype=torch.float32) target = torch.tensor([0.0, 2.0, 1.5], dtype=torch.float32) weights = torch.tensor([1.0, 0.5, 1.5], dtype=torch.float32) loss = weighted_huber_loss(input, target, weights, delta=1.0) print(loss) ``` --- Feedback on these implementations is welcome; please let me know if further modifications are required. Resolves #132465 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132049 Approved by: https://github.com/mikaylagawarecki Co-authored-by: mikaylagawarecki <mikaylagawarecki@gmail.com>	2024-10-31 21:59:43 +00:00
Simon Fan	82e74ad40e	[aot autograd] refactor CompiledFunction.backward: control flow (3/N) (#139347 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139347 Approved by: https://github.com/zou3519 ghstack dependencies: #139331, #139343	2024-10-31 21:53:03 +00:00
Simon Fan	8134456a27	[aot autograd] refactor CompiledFunction.backward: epilogue (2/N) (#139343 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139343 Approved by: https://github.com/zou3519 ghstack dependencies: #139331	2024-10-31 21:53:03 +00:00
Simon Fan	04ce9ec087	[aot autograd] refactor CompiledFunction.backward: prologue (1/N) (#139331 ) So for functional autograd + CA, most nodes are inlined in aot autograd. But user-defined callables aren't safe to make_fx unless dynamo traces through them. The AOT backward must be inlined by dynamo time. We plan to directly insert calls to the backward in the graph: - call prologue - call bwd graph - call epilogue Restructuring our AOT bwd implementation will make this implementation easier. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139331 Approved by: https://github.com/zou3519	2024-10-31 21:53:03 +00:00
angelayi	8c22e09e39	[aoti] Add masked_select to cshim (#139071 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/139071 Approved by: https://github.com/desertfire	2024-10-31 21:52:53 +00:00
PyTorch MergeBot	b9acbde4fd	Revert "Update tensorify pass to specialize symfloats we didn't tensorify away (#138868 )" This reverts commit a49457279919b324d8ca1db85636d16d6dfd4e0f. Reverted https://github.com/pytorch/pytorch/pull/138868 on behalf of https://github.com/huydhn due to Sorry for reverting your change but I think the new tests are failing on fbcode ([comment](https://github.com/pytorch/pytorch/pull/138868#issuecomment-2450863895))	2024-10-31 21:46:06 +00:00
Laith Sakka	6a1c451479	Don't uselessly recompute axiom dict every static eval call (#138967 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/138967 Approved by: https://github.com/ezyang	2024-10-31 21:16:55 +00:00
PyTorch MergeBot	c4d9428b17	Revert "[AOTI] Update zero size computation in clone_preserve_strides (#139224 )" This reverts commit 206a8dde68faef052dfeedabb4180179ab24015e. Reverted https://github.com/pytorch/pytorch/pull/139224 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/139224#issuecomment-2450811914))	2024-10-31 21:05:07 +00:00
Joel Schlosser	ddb291a881	Fix and test several NJT reductions (#139317 ) I'm sick of reductions not working properly - spotty dim coverage, missing backwards, etc. This PR fixes quite a bit. It applies to the following ops: * `sum` / `mean` / `prod` * `all` / `any` * `amin` / `amax` * `min` / `max` * `argmin` / `argmax` The general reduction logic has been factored out into a helper `_apply_reduction(func, func_name, identity_element, args, kwargs)`. The idea is that by providing a valid identity element, we can utilize conversions to padded dense when needed for reducing over the ragged dim. Extensive test coverage includes: reductions across ragged dim * reductions across non-batch, non-ragged dims * reductions across both batch and ragged dims * multiple dim reductions (for ops that support this) * full reduction -> scalar Bonus: the PR includes backwards fixes for `sum` and `mean`, which have never worked. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139317 Approved by: https://github.com/cpuhrsch	2024-10-31 20:55:38 +00:00
PyTorch MergeBot	abb0dd4b00	Revert "[inductor] patterns to remove pointless view/permute pairs (#139136 )" This reverts commit 2b86cd74a60ca2483173ba3012506aeac85ab2d7. Reverted https://github.com/pytorch/pytorch/pull/139136 on behalf of https://github.com/ZainRizvi due to Sorry but this PR seems to have broken on trunk. The failure: distributed/_composable/test_replicate_with_compiler.py::ReplicateTest::test_bucketing_coalesced_op [GH job link](https://github.com/pytorch/pytorch/actions/runs/11615060962/job/32346609889) [HUD commit link](`2b86cd74a6`) ([comment](https://github.com/pytorch/pytorch/pull/139136#issuecomment-2450796414))	2024-10-31 20:54:17 +00:00
Justin Chu	76b5ee1119	[ONNX] Set flags correctly in tests (#139413 ) Previously the flag was set via envvar, since the envvar was read at initialization, it may not have been correctly set. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139413 Approved by: https://github.com/titaiwangms	2024-10-31 20:46:23 +00:00
Jerry Zhang	938803df94	Add bfloat16 support for per tensor/channel cpu/cuda fake quantize ops (#139306 ) Summary: Fixes https://fb.workplace.com/groups/2240361332735959/permalink/8190736677698365 Test Plan: buck2 test 'fbcode//mode/dev' fbcode//caffe2/test/quantization:test_quantization -- --exact 'caffe2/test/quantization:test_quantization - test_forward_per_channel_cachemask_cpu (caffe2.test.quantization.core.test_workflow_ops.TestFakeQuantizeOps)' buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/quantization:test_quantization -- --exact 'caffe2/test/quantization:test_quantization - test_forward_per_tensor_cachemask_cpu (caffe2.test.quantization.core.test_workflow_ops.TestFakeQuantizeOps)' buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/quantization:test_quantization -- --exact 'caffe2/test/quantization:test_quantization - test_forward_per_channel_cachemask_cuda (caffe2.test.quantization.core.test_workflow_ops.TestFakeQuantizeOps)' buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/quantization:test_quantization -- --exact 'caffe2/test/quantization:test_quantization - test_forward_per_channel_cachemask_cpu (caffe2.test.quantization.core.test_workflow_ops.TestFakeQuantizeOps)' Differential Revision: D65221710 Pull Request resolved: https://github.com/pytorch/pytorch/pull/139306 Approved by: https://github.com/navsud	2024-10-31 20:41:15 +00:00
drisspg	53c9c19e76	[Autotune Inductor] Some clean up and dataclassing (#139157 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139157 Approved by: https://github.com/eellison	2024-10-31 20:04:55 +00:00
William Wen	c2d754441f	[dynamo, 3.13] fix bytecode nop tests (#139323 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139323 Approved by: https://github.com/jansel	2024-10-31 20:03:43 +00:00
Guilherme Leobas	1518cf426b	Remove `@skipIfTorchDynamo` from test_extremal_numerics_l1_loss_cpu test (#139318 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139318 Approved by: https://github.com/zou3519, https://github.com/williamwen42	2024-10-31 19:57:28 +00:00
PyTorch MergeBot	886579af99	Revert "Use static_assert to detect get_type_index used in device code (#139173 )" This reverts commit d391ed3f4ec6b1a78f7b34e27cba74b37d885475. Reverted https://github.com/pytorch/pytorch/pull/139173 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/139173#issuecomment-2450695123))	2024-10-31 19:50:19 +00:00
Shivam Raikundalia	ac7acfb894	[Profiler] Create Auto-Trace Frontend for Trace ID (#139310 ) Summary: This PR adds Auto-Trace implementation for Trace ID. By default, the python side will generate a uuid in the same format as the one set in the backend by kineto. Upon running an auto-trace, the python generated trace id will overwrite the one set in kineto using the Config variable. Since we don't expect users to generate on-demand traces after an auto-trace we can simply keep overwriting the backend trace id whenever autotrace is ran. If we one day want to eventually do something like this, we simply have to add a call in kineto on the backend to generate a new ID upon start of profiling. We also implement a custom callback in the frontend such that users can generate their own trace ids if they wish to. This works similarly as the default, only difference being that they have to manually set this callback after a profiler is generated. We use a specific call to set this rather then putting it in the frontend initializer in case users want to change the trace_id for different repeats. Test Plan: Tested both default and custom callbacks using the verbose prints added. Trace ids on the frontend and the prints on the backend for the manifold upload matched. Differential Revision: D65178308 Pull Request resolved: https://github.com/pytorch/pytorch/pull/139310 Approved by: https://github.com/shengfukevin	2024-10-31 19:02:57 +00:00
Xuehai Pan	7faf0ad913	[dyanmo] fix `deque.maxlen` support when extending elements from left (#139279 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139279 Approved by: https://github.com/jansel	2024-10-31 18:38:11 +00:00
bskrlj	8e27833e30	Ensure SWA boundary conditions w.r.t. definition (#133773 ) According to the documentation, decay is a number in [0,1] range,[ i.e.](https://pytorch.org/docs/stable/optim.html) ``` Decay is a parameter between 0 and 1 that controls how fast the averaged parameters are decayed. If not provided to get_ema_multi_avg_fn, the default is 0.999. ``` An inspection of `swa_utils.py` indicates there are no checks for invalid values of `decay`. Adding asserts as suggested in this PR ensures valid compute range (one way to enforce correct behavior, there are perhaps more suitable ones). Papers `torch` cites for reference idea/implementation also consider exclusively this range (e.g., https://arxiv.org/pdf/2310.04415). Fixes https://github.com/pytorch/pytorch/issues/133772 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133773 Approved by: https://github.com/janeyx99	2024-10-31 18:24:08 +00:00
Will Constable	547d921462	[Pipelining] Remove unused special case from simulator (#138928 ) The special case was added during experimentation with batched send/recv ops. The ops needed to be jointly scheduled or the simulator would think that each op was unschedulable since each contained a recv that depended on the other's send. The workaround I added was to let the scheduler 'peek' one op ahead for unblocking, which let batched ops be scheduled but also changed the behavior or non-batched ops. It let RECV ops be simulated one step earlier than the unblocking SEND ops, which shortened the simulated duration of schedules. Removing this workaround simplifies the simulator but more importantly lends to optimizing the runtime of the simulator by making it much easier to avoid copying or extending lists of previous ops on each iteration. It also restores the output of the simulator for non-batched ops to a more natural output where RECV must happen at the same time or later than matching SEND, rather than possibly a step earlier. For example, for this test: `python test/distributed/pipelining/test_schedule.py -k test_send_recv_test_info0` Before: ``` Step 0: 0F0 1RECV_F0 Step 1: 0SEND_F0 Step 2: 0F1 1RECV_F1 Step 3: 0SEND_F1 1F0 Step 4: 0RECV_B0 1B0 Step 5: 0B0 1SEND_B0 Step 6: 1F1 Step 7: 0RECV_B1 1B1 Step 8: 0B1 1SEND_B1 ``` After: ``` Rank 0 Rank 1 Step 00: 0F0 Step 01: 0SEND_F0 1RECV_F0 Step 02: 0F1 Step 03: 0SEND_F1 1RECV_F1 Step 04: 1F0 Step 05: 1B0 Step 06: 0RECV_B0 1SEND_B0 Step 07: 0B0 1F1 Step 08: 1B1 Step 09: 0RECV_B1 1SEND_B1 Step 10: 0B1 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/138928 Approved by: https://github.com/H-Huang	2024-10-31 17:48:35 +00:00
Nikita Shulga	9d096e4d9f	Don't use deprecated type properties in UpsampleKernel (#139399 ) By replacing `at::CPU(dtype)` pattern with `at::device(kCPU).dtype(dtype)` pattern Pull Request resolved: https://github.com/pytorch/pytorch/pull/139399 Approved by: https://github.com/Skylion007 ghstack dependencies: #139353, #139358	2024-10-31 17:32:19 +00:00
Bin Bao	206a8dde68	[AOTI] Update zero size computation in clone_preserve_strides (#139224 ) Summary: clone_preserve_strides implemented in _inductor/utils.py does not handle multi-dimensional 0-size tensor correctly. Fix that. Differential Revision: [D65250405](https://our.internmc.facebook.com/intern/diff/D65250405) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139224 Approved by: https://github.com/angelayi	2024-10-31 17:07:18 +00:00
eellison	f93ebb2cf4	[Easy] Refactor post grad application of passes (#139293 ) Refactors GraphTransformObserver to hook into the bisect manager pass application. And reworks post grad passes to use it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139293 Approved by: https://github.com/exclamaforte ghstack dependencies: #139292	2024-10-31 17:05:27 +00:00
Shuqiang Zhang	5075046db2	[c10d] separate comm init from getNCClComm (#139362 ) Summary: This PR is a non op. But it clearly separate the init logic from the getNCCLCOMM. getNCClComm is now a purely a 'read' only function Test Plan: existing CI Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/139362 Approved by: https://github.com/wconstab	2024-10-31 16:58:20 +00:00
James Wu	864beebb41	[easy] Add start event metadata to collected metadata for PT2 Compile Events (#139289 ) We should be logging metadata from event starts to PT2 Compile Events too. Differential Revision: [D65070086](https://our.internmc.facebook.com/intern/diff/D65070086/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139289 Approved by: https://github.com/oulgen	2024-10-31 16:52:30 +00:00
Tomasz Bohutyn	dd6263e2fb	Implement HPUHooksInterface (#137338 ) Fixes #137262 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137338 Approved by: https://github.com/guangyey, https://github.com/albanD Co-authored-by: Yu, Guangye <106960996+guangyey@users.noreply.github.com>	2024-10-31 16:26:19 +00:00
PyTorch MergeBot	87f1990697	Revert "Don't uselessly recompute axiom dict every static eval call (#138967 )" This reverts commit 24b695ae2d5d85a3bda0e493fb4631d5e0add290. Reverted https://github.com/pytorch/pytorch/pull/138967 on behalf of https://github.com/ZainRizvi due to Sorry, looks like this PR introduced a failure that was incorrectly classified as flaky, and the log classifier didn't identify the right log line either ([comment](https://github.com/pytorch/pytorch/pull/138967#issuecomment-2450228525))	2024-10-31 15:54:18 +00:00
Shunting Zhang	2b86cd74a6	[inductor] patterns to remove pointless view/permute pairs (#139136 ) These are not artificial patterns I come up. They shows up in linear+CrossEntropyLoss graph. Consider this snippet: ``` class LinearAndCEL(nn.Module): def __init__(self): super().__init__() self.linear = nn.Linear(C, V) self.ce = nn.CrossEntropyLoss() def forward(self, x, y): return self.ce(self.linear(x).view(B * T, V), y.view(-1)) ``` `x` passed to `forward` is a 3D tensor of shape [B, T, C]. The `self.linear` will view x as [BxT, C] shape tensor first, do the matmul and produce a [BxT, V] tensor, and then view this output back to a 3D tensor with shape [B, T, V]. User code is gonna add another view op to convert the tensor shape to [B x T, V]. This generates a pair of redundant views . A pair of redundant permute happens in the backward part when we compute gradients. The view ops makes it hard to chunk linear+CEL. When the view op breaks up the dimension being chunked, what should the chunker do (even if we merge those dimension again later)? Removing these pointless view pairs makes the chunker simpler. And I think it's in general nice to do. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139136 Approved by: https://github.com/Chillee, https://github.com/jansel	2024-10-31 15:35:46 +00:00
Sam Larsen	d21a25c6b7	[fx graph cache] Refactor FxGraphCachePickler, step 2 (#138683 ) Summary: Move all the custom `_reduce_*` functions inside the FxGraphCachePickler class. This is mostly a cosmetic change since they're conceptually members of FxGraphCachePickler. But also in an upcoming diff, I'll add a member variable to the class to control how we handle constant tensors, so it will be convenient to be able to query that setting via `self`. I made the analogous changes to AOTAutogradCachePickler for consistency. Test Plan: unit tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/138683 Approved by: https://github.com/eellison ghstack dependencies: #138681, #138682	2024-10-31 15:12:18 +00:00
Nikita Shulga	92a2a9ded2	[BE] And delete `DeprecatedTypProperties` cast (#139358 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139358 Approved by: https://github.com/ezyang ghstack dependencies: #139353	2024-10-31 14:39:22 +00:00
FFFrog	ea07718a5a	Remove redundant warning compress (#139367 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139367 Approved by: https://github.com/cyyever, https://github.com/Skylion007	2024-10-31 14:39:19 +00:00
augusto.yjh	c934ed6567	init kineto after torch module initialized (#131448 ) Fixes #131020 As discussed in the issue thread, we can use ` KINETO_DAEMON_INIT_DELAY_S` to delay the initialization of `kineto` in case `kineto` is initialized before `libtorch_cuda.so`. It's not clear to set a proper value of environmental variable `KINETO_DAEMON_INIT_DELAY_S`, here's a trick to make the initialization of `kineto` after the initialization of module `torch`. I'm not sure whether this is an acceptable trick, please take a look at this pr, thanks. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131448 Approved by: https://github.com/sraikund16, https://github.com/briancoutinho	2024-10-31 13:24:24 +00:00
rzou	ccaa2a206a	[inductor] make requires_stride_order more unbacked-symint-aware (#137063 ) Previously, we tried to sort SymInt strides to determine the stride order. This PR makes the sorting more unbacked symint aware: given a Tensor with sizes (u0, u1, u2), it has strides (u1 * u2, u1, 1), which is sortable under the guard_size_oblivious assumptions. Test Plan: - test case Pull Request resolved: https://github.com/pytorch/pytorch/pull/137063 Approved by: https://github.com/eellison	2024-10-31 13:11:02 +00:00
Wu, Chunyuan	3192bdeea4	[AOTI] Use `len(serialized_weights)` when calculating `consts_size` (#139054 ) Fixes the failure of INT8 DLRM using AOTI. The previous code calculates `consts_size` directly using `tensor` from `graph.constants`: ``` consts_size = sum( get_nbytes_of_tensor(tensor, all_cuda) for (name, tensor) in graph.constants.items() if name not in graph.folded_constants ) ``` Meanwhile, the actual bytes to serialize (`serialized_weights`) is using `graph.get_original_value_of_constant(name)`: ``` serialized_weights = b"".join( _to_bytes(graph.get_original_value_of_constant(name), all_cuda) for name in graph.constants.keys() if name not in graph.folded_constants ) ``` `tensor` from `graph.constants` could be different from `graph.get_original_value_of_constant(name)` thus making the `consts_size` inconsistent with the actual byte size of the `serialized_weights`, resulting in runtime error `weights_offset must be aligned to 16K boundary`, similar to what happened in https://github.com/pytorch/pytorch/pull/135205. This PR direclty gets `consts_size ` using `len(serialized_weights)`, which fixes the inconsistency. We also added a `reduce_range` argument to the `get_default_x86_inductor_quantization_config` function, which is needed in the unit test to avoid accuracy issue on CI machines (earlier CPUs without VNNI). Pull Request resolved: https://github.com/pytorch/pytorch/pull/139054 Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5, https://github.com/desertfire	2024-10-31 09:54:16 +00:00
Laith Sakka	24b695ae2d	Don't uselessly recompute axiom dict every static eval call (#138967 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/138967 Approved by: https://github.com/ezyang	2024-10-31 07:46:35 +00:00
Scott Wolchok	73fde0d940	[PyTorch] Unbreak C10_ALWAYS_INLINE_ATTRIBUTE on MSVC (#139363 ) At least one recent version refuses to accept it on a lambda, so disable. Differential Revision: [D65250256](https://our.internmc.facebook.com/intern/diff/D65250256/) NOTE FOR REVIEWERS: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D65250256/)! Pull Request resolved: https://github.com/pytorch/pytorch/pull/139363 Approved by: https://github.com/ngimel, https://github.com/malfet	2024-10-31 07:40:05 +00:00
Huy Do	f98bc9a49d	Revert D65167805 (#139371 ) Summary: This diff reverts D65167805 broke the release pipeline Test Plan: NA Differential Revision: D65245198 @diff-train-skip-merge (to silent facebook-github-bot until I have a stamp to land this) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139371 Approved by: https://github.com/malfet	2024-10-31 07:25:28 +00:00
Nikita Shulga	86e6513c86	[BE] Remove deprecated `AT_DISPATCH_ALL_TYPES_AND_HALF` (#139353 ) It's been deprecated for 2 years now, time to delete Pull Request resolved: https://github.com/pytorch/pytorch/pull/139353 Approved by: https://github.com/ezyang	2024-10-31 07:06:19 +00:00
Jeff Daily	a7479fa282	TunableOp use dense size calculations as minimum sizes (#139137 ) Fixes #139116. Also fixes other unreported issues with torch.bmm due to incorrect size calculations. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139137 Approved by: https://github.com/yoyoyocmu	2024-10-31 06:01:58 +00:00
Nhat Minh Luu	261d90c18f	Add docs page for `torch.inf` and `torch.nan` (#138430 ) Fixes #131040 ## Description Add docs for `torch.inf` and `torch.nan`, ## Checklist - [x] The issue that is being fixed is referred in the description (see above "Fixes #ISSUE_NUMBER") - [x] Only one issue is addressed in this pull request - [x] Labels from the issue that this PR is fixing are added to this pull request - [x] No unnecessary issues are included into this pull request. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138430 Approved by: https://github.com/ezyang	2024-10-31 05:46:46 +00:00
cyy	f95c71867e	[9/N] Fix extra warnings brought by clang-tidy-17 (#139286 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/139286 Approved by: https://github.com/ezyang	2024-10-31 05:20:31 +00:00
FFFrog	42b5e191ae	Fix the example of fx/interpreter (#139368 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139368 Approved by: https://github.com/ezyang	2024-10-31 05:12:43 +00:00
Yu, Guangye	d08dbd0436	Update torch-xpu-ops commit pin (#139041 ) # Motivation This PR intends to update torch-xpu-ops commit pin. It mainly includes the following two highlighted changes: 1. split the DLL library into 4 smaller libraries to avoid the 2G limitation on Windows; 2. some new operators added, for example, `cdist`, `pdist`, `maxunpool2d`, `maxunpood3d`, `upsample_trilinear3d, `Bessel operators`, etc... # Additional Context We have to supply XPU device check logic in `cdist` and `pdist` ops. This PR depends on https://github.com/pytorch/pytorch/pull/139050 to fix Windows build issue. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139041 Approved by: https://github.com/EikanWang, https://github.com/ezyang	2024-10-31 05:06:06 +00:00
Bob Ren	74b7fb9519	Add conjugate method on SymFloat (#139249 ) Fixes python test/dynamo/test_dynamic_shapes.py DynamicShapesFunctionTests.test_number_method_method_conjugate_num_type4_dynamic_shapes when we turn off specialize float on eager: https://github.com/pytorch/pytorch/pull/138915 Pull Request resolved: https://github.com/pytorch/pytorch/pull/139249 Approved by: https://github.com/ezyang	2024-10-31 04:55:36 +00:00
kshitij12345	0cf4cc3d5f	[fx] split_module subgraph should always have an output node (#139275 ) Fixes https://github.com/pytorch/pytorch/issues/138207 Pull Request resolved: https://github.com/pytorch/pytorch/pull/139275 Approved by: https://github.com/ezyang	2024-10-31 04:53:19 +00:00
Sam Larsen	e3e3ab805b	[fx graph cache] Refactor FxGraphCachePickler (#138682 ) Summary: In an upcoming change, we need to modify FxGraphCachePickler to behave differently depending on whether the graph has frozen parameters (whether or not we have frozen parameters). To do that, it will be convenient to change FxGraphCachePickler into a regular object instead of a collection of classmethods. Test Plan: unit tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/138682 Approved by: https://github.com/eellison ghstack dependencies: #138681	2024-10-31 03:31:51 +00:00
cyy	70ba471957	[3/N] Fix clang-tidy warnings in python_variable_methods.cpp (#139248 ) Follows #139158 Pull Request resolved: https://github.com/pytorch/pytorch/pull/139248 Approved by: https://github.com/ezyang	2024-10-31 03:29:19 +00:00
cyy	1dd503c6fb	[4/N] Fix Wextra-semi warning (#139256 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/139256 Approved by: https://github.com/ezyang	2024-10-31 03:01:14 +00:00
Piotr Bialecki	bd88d40e5f	[Submodule] update submodule onnx==1.17.0 (#139128 ) Follow-up PR of: https://github.com/pytorch/pytorch/pull/138719 CC @malfet @ezyang Pull Request resolved: https://github.com/pytorch/pytorch/pull/139128 Approved by: https://github.com/malfet	2024-10-31 02:50:00 +00:00
cyy	29297731bb	[5/N] Don't skip ASAN on some tests (#139265 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/139265 Approved by: https://github.com/ezyang	2024-10-31 02:49:03 +00:00
Wu, Chunyuan	d7411c0cc1	[AOTI] add C shim for QConvPointWise (#138540 ) This PR adds C shim for `QConvPointWisePT2E` and `QConvPointWiseBinaryPT2E` similar to https://github.com/pytorch/pytorch/pull/138439. Besides that, we aligned the implementation of `qconv_pointwise` with `qlinear_pointwise` in the following aspects: 1. The parameter order of `qconv_pointwise` and `qlinear_pointwise` are quite different, we aligned the schema of `qconv_pointwise` to have similar parameter order as `qlinear_pointwise` to make it more consistent. 2. We always converted `x_scale` and `x_zero_point` to Tensors, just like in the lowering of `qlinear_pointwise`. This avoids the need to create two separate C APIs (one for `double x_scale` and `int64_t x_zero_point`, and another for `Tensor` versions). Instead, we only need one API for `Tensor`-based `x_scale` and `x_zero_point`. If we later add dynamic quantization for qconv (which will use `Tensor` for `x_scale` and `x_zero_point`), we can reuse the code from this PR and don't need to change the C shim layer API. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138540 Approved by: https://github.com/jgong5, https://github.com/desertfire ghstack dependencies: #138691, #138806	2024-10-31 02:03:01 +00:00
Oguz Ulgen	69ea2e726c	Consolidate Triton cache into Inductor cache (#138239 ) Summary: This diff/PR attempts to consolidate Triton caching into the Inductor caching so that there can be just one cache that unifies them both, reducing network requests and increasing success rate. Implementation details can be found via reading the code or the post: https://fb.workplace.com/groups/1553867532149891/posts/1605037517032892 I did not use the Autotune bundler code at all since I want to simplify that and merge it into this on the next diff/PR. In terms of instrumentation 1) Dynamo compile: `triton_bundler_time_saved_s` this is sum of all triton.compile calls. We dont have to use the specific number, can use this as a binary value. 2) Events table: I used dynamo_timed to measure how much time we spend on bundler collect and write functions which is all the work we do in this diff 3) TLParse: I emitted number of kernels and triton_bundler_time_saved_s into tlparse as well Test Plan: Updated unit tests Adhoc running ``` TORCHINDUCTOR_BUNDLE_TRITON_INTO_FX_GRAPH_CACHE=1 buck2 run @mode/opt //scripts/oulgen:runner ``` gives https://interncache-all.fbcdn.net/manifold/tlparse_reports/tree/logs/.tmpmTZt6b/0_0_0/fx_graph_cache_hit_4.json <img width="771" alt="image" src="https://github.com/user-attachments/assets/478782a2-ee47-40cb-b723-fcac2bf9dd93"> Differential Revision: D64504909 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138239 Approved by: https://github.com/ezyang	2024-10-31 01:37:16 +00:00
Edward Z. Yang	c7f1fccd7a	Globally enable Python dispatcher for all of Inductor compilation (#137621 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/137621 Approved by: https://github.com/eellison	2024-10-31 01:35:23 +00:00
PyTorch MergeBot	289e03a429	Revert "Allow inplacing buffer when other users are inconsequential (#138383 )" This reverts commit 8840889c3f6565b7975150adebcbe062f19035ee. Reverted https://github.com/pytorch/pytorch/pull/138383 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but it seems to break trunk after landing ([comment](https://github.com/pytorch/pytorch/pull/138383#issuecomment-2448824206))	2024-10-31 01:32:15 +00:00
Yidi Wu	38429938de	[cond] make cond not throw warnings on constant pred in eager mode (#138837 ) We don't raise warnings for torch.cond in eager mode the motivation is in https://github.com/pytorch/pytorch/issues/138782. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138837 Approved by: https://github.com/zou3519	2024-10-31 01:13:19 +00:00
Saurabh Mishra	b90503d9ae	[DCP] Unit Test to validate the stateful and non-stateful loads (#139251 ) Summary: Unit Test to validate the stateful and non-stateful loads. This test is a follow up to the fix in [#138575](https://github.com/pytorch/pytorch/pull/138575) which addresses an issue in stateful dict's in-place updates in distributed checkpoint loading. Also, added additional code comments regarding the stateful and non-stateful loads. Test Plan: ``` buck2 test //caffe2/test/distributed/checkpoint/e2e:test_e2e_save_and_load ``` https://www.internalfb.com/intern/testinfra/testrun/8162774562859797 Differential Revision: D65188659 Pull Request resolved: https://github.com/pytorch/pytorch/pull/139251 Approved by: https://github.com/LucasLLC, https://github.com/fegin	2024-10-31 01:12:51 +00:00
Nichols A. Romero	7ed0d69004	[ROCm] Increase hipBLASLt default workspace size (#139300 ) This PR increases hipBLASLt default workspace size to 76 MB which is the recommended default. This PR does not contain any bug fixes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139300 Approved by: https://github.com/jeffdaily, https://github.com/eqy	2024-10-31 00:56:54 +00:00
PyTorch MergeBot	42d790bb65	Revert "Add conjugate method on SymFloat (#139249 )" This reverts commit bcf8a0124fbadb469f6766eb7555a75ea0fa9d43. Reverted https://github.com/pytorch/pytorch/pull/139249 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but the doc build failure is legit ([comment](https://github.com/pytorch/pytorch/pull/139249#issuecomment-2448755839))	2024-10-31 00:45:48 +00:00
eellison	4db6b740bc	[Easy] GraphTransformObserver Refactoring (#139292 ) Uses `torch._inductor.config.trace.log_url_for_graph_xform` by default as the log url. It was only ever instantiated with this as the log_url argument. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139292 Approved by: https://github.com/shengfukevin, https://github.com/shunting314	2024-10-31 00:33:28 +00:00
Yu, Guangye	8fa0bc3358	Use cached dnnl::stream in GpuStreamManager (#139176 ) # Motivation The code changes in `GpuStreamManager` class intend to help manage `dnnl::stream` efficiently. # Addtional Context Use the following code to simply benchmark. ```python import torch import time device = torch.device("xpu") M, N, K = 64, 64, 64 # You can change these dimensions as needed torch.manual_seed(0) A = torch.randn(M, K, device=device) B = torch.randn(K, N, device=device) # Warm-up for _ in range(10): torch.matmul(A, B) s1 = torch.xpu.Stream() s2 = torch.xpu.Stream() # Measure the time for the GEMM operation start_time = time.time() with torch.xpu.stream(s1): for _ in range(50000): C = torch.matmul(A, B) with torch.xpu.stream(s2): for _ in range(50000): D = torch.matmul(A, B) torch.xpu.synchronize() end_time = time.time() # Calculate elapsed time elapsed_time = end_time - start_time # Print the results print(f"Time taken for GEMM operation: {elapsed_time:.6f} seconds") ``` Compared with the old implementation elapses 2.077069s, the new implementation consumes 2.023017s, which means ~2% performance improvement. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139176 Approved by: https://github.com/gujinghui, https://github.com/jgong5	2024-10-31 00:23:39 +00:00
Brian Hirsh	f81223938c	support nesting of suppress_guards, suppress guards when generated compiled autograd graph (#138968 ) Fixes https://github.com/pytorch/pytorch/issues/138920. See comments there for details. I still need to try to get a smaller repro to write an actual test. But suppressing the guards, I now no longer see the specilization in the CA graph in the linked example: ``` aot1_view_3: ... = torch.ops.aten.view.default(aot1_tangents_1, [aot1_sym_size_int, 48, 1]) aot1_view_4: ... = torch.ops.aten.view.default(aot1_view_3, [aot1_sym_size_int, 48]) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/138968 Approved by: https://github.com/yf225, https://github.com/xmfan	2024-10-31 00:13:39 +00:00
cyy	d391ed3f4e	Use static_assert to detect get_type_index used in device code (#139173 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/139173 Approved by: https://github.com/r-barnes, https://github.com/ezyang	2024-10-31 00:06:53 +00:00
Catherine Lee	f747bd2947	Move slow test query to ClickHouse (#139322 ) Example run: https://github.com/pytorch/pytorch/actions/runs/11602255032/job/32306827867?pr=139322 (pr creation commented out), also tested locally Pull Request resolved: https://github.com/pytorch/pytorch/pull/139322 Approved by: https://github.com/huydhn	2024-10-30 23:58:27 +00:00
cz2h	48854cbfc4	Add missing operator and corresponding unittest (#138309 ) Fixes https://github.com/pytorch/pytorch/issues/129690 Add operator.neg and oepartor.pos into _SYM_BOOL_OPS. Provide simple unit test under export/test_serialize.py that can reproduce the issue. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138309 Approved by: https://github.com/ezyang, https://github.com/angelayi	2024-10-30 23:50:24 +00:00
Sherlock Huang	f32b9a5145	Fx graph always return tuple in fuse_as_graphmodule (#139236 ) Summary: As title. Test Plan: Let's see what OSS CI says Differential Revision: D65147426 Pull Request resolved: https://github.com/pytorch/pytorch/pull/139236 Approved by: https://github.com/ezyang	2024-10-30 23:31:06 +00:00
Bob Ren	a494572799	Update tensorify pass to specialize symfloats we didn't tensorify away (#138868 ) As discussed w/ @ezyang offline, one way to de-risk the `specialize_float=False` rollout is to specialize all backed symfloats that we fail to tensorify away. This diff does a few things: 1) It fixes a bug where item_memo gets dropped (due to incorrect epoch invalidation) 2) It updates the tensorify pass to do the backup specialization This pass was originally part of the [PR](https://github.com/pytorch/pytorch/pull/137782) that flips `specialize_float=False` but we learned that the blast radius is simply too large. We've pivoted to a more milestone driven approach where we learn from the failures of the aforementioned PR and cherry pick fixes into main first. After this current PR lands our strategy is as follows: 1) Integrate turning off specialize float only in the automatic dynamic pass. 2) Put up a canary diff that only turns off specialize float in `backend=eager` mode to sniff out symfloat related bugs in dynamo due to code paths we previously never exercised. 3) Put up a canary diff that only turns off specialize float in `backend=aot_eager` mode to sniff out symfloat related bugs in aotautograd due to code paths we previously never exercised. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138868 Approved by: https://github.com/ezyang	2024-10-30 23:28:25 +00:00
Bob Ren	bcf8a0124f	Add conjugate method on SymFloat (#139249 ) Fixes python test/dynamo/test_dynamic_shapes.py DynamicShapesFunctionTests.test_number_method_method_conjugate_num_type4_dynamic_shapes when we turn off specialize float on eager: https://github.com/pytorch/pytorch/pull/138915 Pull Request resolved: https://github.com/pytorch/pytorch/pull/139249 Approved by: https://github.com/ezyang	2024-10-30 23:28:09 +00:00
Bob Ren	a426837f85	Don't set replacement if lhs is in the free symbols of the rhs (#139250 ) Fixes python test/dynamo/test_functions.py FunctionTests.test_is_integer when we turn off specialize float on eager: https://github.com/pytorch/pytorch/pull/138915 Pull Request resolved: https://github.com/pytorch/pytorch/pull/139250 Approved by: https://github.com/ezyang	2024-10-30 23:21:30 +00:00
Catherine Lee	754b262bdb	Move close_nonexistent_disable_issues.py queries to ClickHouse (#139296 ) Example run: https://github.com/pytorch/pytorch/actions/runs/11601996563/job/32305991204?pr=139296 (commented out the part that actually closes issues but the queries run) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139296 Approved by: https://github.com/huydhn	2024-10-30 23:09:39 +00:00
Edward Z. Yang	ae6cbd4256	Block more keys from config serialization (#139285 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/139285 Approved by: https://github.com/jovianjaison, https://github.com/markkm, https://github.com/c00w	2024-10-30 23:05:59 +00:00
Will Constable	4a8d12227e	[Pipelining] add schedule simulator and chrometrace dump (#138134 ) Schedule simulator is useful for detecting hangs in schedules and validating that they won't hang. It also inserts bubbles (None actions) at any timestep where a rank can not enqueue its next action due to unmet dependencies, which can serve as a rough metric for schedule efficiency. The output can be visualized. The simulator expects a full comm + compute schedule as input. Chrometrace dump is a basic visualization utility. It currently just renders one 'process' per rank, and lets users visualize the schedule in a UI instead of as text. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138134 Approved by: https://github.com/H-Huang	2024-10-30 23:00:58 +00:00
PyTorch MergeBot	ec5fbee6c0	Revert "Drop caffe2 string_utils (#139217 )" This reverts commit 1797a2035d92d25d3dcc46fd8facdd6569b30c53. Reverted https://github.com/pytorch/pytorch/pull/139217 on behalf of https://github.com/huydhn due to Chatting with @r-barnes, this is still used in lots of place internally ([comment](https://github.com/pytorch/pytorch/pull/139217#issuecomment-2448568071))	2024-10-30 22:23:32 +00:00
Yukio Siraichi	fef5e94657	`addmm`: error on output dtype mismatch. (#138520 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138520 Approved by: https://github.com/ezyang ghstack dependencies: #138515	2024-10-30 21:46:39 +00:00
Yukio Siraichi	6da3a043a8	Add test for consistency between meta and CPU devices. (#138515 ) Reference: https://github.com/pytorch/pytorch/issues/138399 This PR introduces an `OpInfo` test that checks whether running each `out=` operation using meta inputs is consistent with using concrete (e.g. CPU) inputs. More specifically, it tests the case where the output tensors are not of the expected data type. According to the `out=` specification, some operations should error. I have added XFAIL to the set of operations that are currently failing. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138515 Approved by: https://github.com/ezyang	2024-10-30 21:46:39 +00:00
Catherine Lee	24c9683355	[mergebot] Add ci-no-td label on revert (#139218 ) Just in case? Pull Request resolved: https://github.com/pytorch/pytorch/pull/139218 Approved by: https://github.com/wdvr	2024-10-30 21:36:09 +00:00
Gabriel Ferns	8840889c3f	Allow inplacing buffer when other users are inconsequential (#138383 ) Summary: I think we can inplace a buffer if all of the users of said buffer are "inconsequential", defined as having been removed, being completed, or being part of the ancestors set. In particular, this allows LayerNorm to inplace its input buffer. Implements: https://github.com/pytorch/pytorch/issues/132826 Test Plan: New unit test of matmul followed by LayerNorm, make sure there's an inplaced buffer. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138383 Approved by: https://github.com/eellison	2024-10-30 21:35:50 +00:00
Richard Zou	ad0883a288	[real_tensor_prop] Infer Fake kernels during real tensor prop (#139213 ) This PR changes real_tensor_prop to also infer fake kernels when the operator doesn't have it. We infer the fake output to be of the same properties as the real output, with unbacked symints in the sizes and some stride order. Test Plan: - new tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/139213 Approved by: https://github.com/pianpwk ghstack dependencies: #139212	2024-10-30 21:29:33 +00:00
Zhengxu Chen	03ec25053a	[export] Update min_val and max_val to Optional[int] in serialization. (#139223 ) Summary: According to export team's discussion, we are upgrading min_val and max_val to optional fields which shouldn't break BC and allows the schema to express infinity. Test Plan: buck test mode/opt caffe2/test:test_export -- -r test_serialize_infinite_sym_int Differential Revision: D65167805 Pull Request resolved: https://github.com/pytorch/pytorch/pull/139223 Approved by: https://github.com/yiming0416	2024-10-30 21:14:17 +00:00
Xu Han	6d5944c9f1	turn off USE_MIMALLOC_ON_MKL temporary. (#139204 ) Fixes #138994 We can turn off `USE_MIMALLOC_ON_MKL` temporary. Due to it caused https://github.com/pytorch/pytorch/issues/138994 For totally fixed, we need fix `USE_STATIC_MKL` lost functionality issue: https://github.com/pytorch/pytorch/pull/138996, and then get the correctly MKL linking type(shared/static). It still need some time to pass all CI and builder scripts. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139204 Approved by: https://github.com/ezyang	2024-10-30 21:09:21 +00:00
Eddie Yan	05cb98f91d	[TF32][Inductor] Account for TF32 in `test_inductor_layout_optimization_input_mutations` (#138948 ) Tests using a conv2d kernel which can dispatch to a TF32-backed implementation Pull Request resolved: https://github.com/pytorch/pytorch/pull/138948 Approved by: https://github.com/ezyang	2024-10-30 20:34:16 +00:00
Huy Do	77e25d57b0	Create ciflow/inductor-periodic (#138763 ) This is related to https://github.com/pytorch/pytorch/issues/138476. This would save about 1/8 of the total cost, not a big number, but still a save I guess. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138763 Approved by: https://github.com/desertfire	2024-10-30 19:59:44 +00:00
Richard Zou	ef380f7b8e	[real tensor prop] Add some asserts for custom ops (#139212 ) When we see a custom op: - check that its mutation annotations are correct - check that its aliasing constraints matches our constraints for custom ops. Otherwise, there may be undefined behavior. Test Plan: - new tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/139212 Approved by: https://github.com/angelayi	2024-10-30 19:29:11 +00:00
Alex Baden	5c6d35482e	[Inductor] Support Triton AttrsDescriptor cls field (#139193 ) Fixes #139179 Adding corresponding changes to https://github.com/triton-lang/triton/pull/4888 Pull Request resolved: https://github.com/pytorch/pytorch/pull/139193 Approved by: https://github.com/bertmaher	2024-10-30 18:16:38 +00:00
Pian Pawakapan	180d283156	[export] avoid debug name crash for dim hints (#139104 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/139104 Approved by: https://github.com/ezyang	2024-10-30 18:12:44 +00:00
Yifu Wang	7765d1ef70	Preliminary registered-buffer collective support via Inductor (#138029 ) ``` NOTE [lowering-time collective optimization] In collective communication libraries such as NCCL, every rank maintains communication buffers that are remotely accessible by some peers. Depending on the underlying transport, remote accessibility may be established via mechanisms such as ib_reg_mr, CUDA P2P, or CUDA multicast. Typically, these buffers are private to the communication library by default, and communication ops copy user data in and out of these buffers. To prevent these copies, an optimization commonly known as "user buffer registration" can be employed. This allows direct establishment of remote accessibility on user buffers, eliminating the need for copying. However, this optimization introduces stringent usage requirements, which are typically hard to satisfy without being intrusive to the user code: - Establishing remote accessibility is expensive and often done ahead of time. In such implementations, all ranks must agree on the set of allocations used for every collective op. Failing to meet this requirement can lead to runtime errors or even silent correctness issues. - Even if the collective communication library supports gracefully falling back to "unregistered" implementations, the fallback mechanism would nullify the optimization. - Some communication mechanisms impose stricter requirements than others. For example, CUDA's multicast + multi-mem instructions require all ranks to agree not only on the allocations used for every collective but also on the offsets within these allocations. To support all different mechanisms with optimal results, we aim to satisfy the strictest requirement for this family of optimizations - we ensures that every collective op invocation is guaranteed to operate on the same allocation, at the same offset, in every iteration. For eligible collective ops, we identify communication buffers at lowering time and optionally choose to lower the op to a different kernel (ommunication libraries like NCCL handle both registered and non-registered buffers transparently within the same op, though some may require different ops for different cases). Later, the codegen will perform "persistent allocation" to satisfy the aforementioned constraints, and optionally, perform buffer planning to optimize overall memory usage. ``` ### Changes - Created `comm_lowering.py` for the lowerings of `_c10d_functional` ops. This is to prevent cluttering `lowering.py` as we add more lowering-time collective optimizations. This PR moved the lowerings for `all_reduce` and `all_reduce_` to the file. - Added `comm_buffer_type: Dict[str, str]` to `GraphLowering` to track whether a buffer is a comm buffer and the type of the comm buffer. - Added codegen allocation support for comm buffers of type "symm_mem". - Added support for auto-lowering `_c10d_functional.all_reduce_` to `symm_mem.one_shot_all_reduce`. - Added an Inductor config for collective optimizations in general (`config._collective`). ### Limitation Currently, each persistently allocated comm buffer is dedicated to a single callsite. This is not viable in terms of memory usage. However, this is a neccesary intermediate state before we tackle memory planning for comm buffers. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138029 Approved by: https://github.com/Chillee ghstack dependencies: #138028	2024-10-30 18:11:09 +00:00
Yifu Wang	421473c234	get_symm_mem_workspace(): print helpful error during graph capture (#138028 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138028 Approved by: https://github.com/weifengpy	2024-10-30 18:11:09 +00:00
Howard Huang	f4ab8b48c5	Allow schedules to run with single stage (#138925 ) Ran into issues (https://github.com/pytorch/pytorch/pull/138863) when adding a Schedule with a single stage, so adding code to support this edge case (mostly for test purposes) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138925 Approved by: https://github.com/wconstab	2024-10-30 17:33:16 +00:00
Antoni Viros	ad637a4c5c	Add support for index_put_ in NT (#135722 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135722 Approved by: https://github.com/jbschlosser	2024-10-30 17:17:59 +00:00
angelayi	f14f245747	[export] Remove custom forward func in swap (#139126 ) Differential Revision: [D65100694](https://our.internmc.facebook.com/intern/diff/D65100694) Remove the custom forward function and instead move the pytree flatten/unflatten ops into the graph. This allows us to natively run via the interpreter. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139126 Approved by: https://github.com/avikchaudhuri	2024-10-30 16:50:57 +00:00
Roy Hvaara	4b83302585	[MPS] Update error message for supported autocast type (#139192 ) Autocast in MPS currently only supports dtype of `torch.float16`. This PR updates the error message to reflect this. This PR was created using [Copilot Workspace](https://copilot-workspace.githubnext.com/pytorch/pytorch/issues/139190?shareId=5b510fda-380c-4e86-8e91-6b67a078f180) with no human input other than clicking buttons. Fixes #139190 Pull Request resolved: https://github.com/pytorch/pytorch/pull/139192 Approved by: https://github.com/malfet	2024-10-30 16:48:29 +00:00
iupaikov-amd	996c40e85e	Adjusted install_user script for Ubuntu 24.04 support (#138815 ) Fixes #138812 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138815 Approved by: https://github.com/pruthvistony, https://github.com/jithunnair-amd, https://github.com/malfet	2024-10-30 16:31:09 +00:00
Arjun Vikram	29eb65fce8	Fix in-place state dict updates for distributed checkpoint loading (#138575 ) `dcp.load()` is documented as "operating in place", updating the state of existing state_dict elements instead of replacing them wherever possible. However, it appears that in the case of a stateful element, the code both updates its state in-place, then replaces it with a copy of itself in the state_dict. This looks like a simple oversight, so here's a PR that should fix it! [From the docs:](https://pytorch.org/docs/stable/distributed.checkpoint.html) > DCP is different than torch.save and torch.load in a few significant ways: ... > - It operates in place, meaning that the model should allocate its data first and DCP uses that storage instead. This manifested as a strange bug in TorchTitan, causing a model loaded from a checkpoint to be saved incorrectly, resulting in a twice-resumed model being subtly broken. Let me know if this makes sense, and if there's anything else I should add! Thanks for all the work on PyTorch! Pull Request resolved: https://github.com/pytorch/pytorch/pull/138575 Approved by: https://github.com/kwen2501, https://github.com/fegin	2024-10-30 16:10:24 +00:00
Bin Bao	04eb15da44	[AOTI] Unify the default value of allow_stack_allocation (#139147 ) Summary: Unify the default value of allow_stack_allocation for fbcode and OSS Differential Revision: D65064673 Pull Request resolved: https://github.com/pytorch/pytorch/pull/139147 Approved by: https://github.com/hl475	2024-10-30 16:01:23 +00:00
Yoshimasa Niwa	6e85266a47	[MPS] Fixes SiLU on non-contiguous tensors (#139006 ) Similar to #123049, however, `SiLU` also produces random values, `0.0`, or `NaN` as results if input tensor is not contiguous on prior to macOS 15.0. Orignally the problem was found at jy0205/Pyramid-Flow#113. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139006 Approved by: https://github.com/malfet	2024-10-30 15:44:59 +00:00
PyTorch MergeBot	49bfbed2eb	Revert "Add deterministic path for CUDA `cumsum` (#136224 )" This reverts commit 383eba522922f0b7c525b88ed4348c64b40b95cf. Reverted https://github.com/pytorch/pytorch/pull/136224 on behalf of https://github.com/ezyang due to larger memory usage apparently not acceptable ([comment](https://github.com/pytorch/pytorch/pull/136224#issuecomment-2447382819))	2024-10-30 14:43:15 +00:00
cyyever	456c87c8a2	[8/N] Fix extra warnings brought by clang-tidy-17 (#139151 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/139151 Approved by: https://github.com/ezyang Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>	2024-10-30 14:20:08 +00:00
Tom Ritchford	44257c063e	[dynamo] Fix constant propagation in builtins and UserClasses (#131354 ) * Fixes https://github.com/pytorch/pytorch/issues/118675 * Replaces https://github.com/pytorch/pytorch/pull/118994 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131354 Approved by: https://github.com/jansel, https://github.com/anijain2305	2024-10-30 12:47:20 +00:00
PyTorch MergeBot	a951d99e16	Revert "Move reduce to template parameter in vectorized_reduction (#138672 )" This reverts commit 9b2c99d731695b76205d617ddc1e799ba11ae1a0. Reverted https://github.com/pytorch/pytorch/pull/138672 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/138672#issuecomment-2446927015))	2024-10-30 12:12:13 +00:00
Xuehai Pan	9bbe4a67ad	[dynamo] support `maxlen` for `collections.deque` (#138194 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138194 Approved by: https://github.com/jansel, https://github.com/malfet	2024-10-30 10:08:02 +00:00
Edward Z. Yang	a4b35767cb	Don't have random print in convert_frame (#139203 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/139203 Approved by: https://github.com/Skylion007	2024-10-30 09:35:37 +00:00
YangQun1	a19bdfb36e	[compiled autograd] reorder backward hooks to match eager behavior (#138553 ) Fixes #138538 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138553 Approved by: https://github.com/xmfan	2024-10-30 08:46:45 +00:00
wz337	b71ab3fc85	[DTensor][Bug Fix]Fix 2D DTensor mm with mesh_shape (1, n) or (n, 1) (#139134 ) Fixes #138742. In the issue, the matrix multiplication with DTensor failed when the size of one of mesh dimension is 1 when the mesh is > 1D. We are missing tests for covering this corner case where mesh_shape is (n, 1) or (1, n). The DTensor mm op is correct when the 1D mesh is of shape (self.world_size, ) or 2D mesh with none of the mesh_dimension has a size of 1. In this PR, we fixed the corner case by updating `gen_einsum_strategies` in `_einsum_strategy.py`. Specifically, we cannot skip generating `mesh_dim_strategies` when `mesh_dim <= 1`, as this is not valid for nD mesh with one of the mesh dimension sizes being 1. Without the fix, the OpStrategy generated for 2D mesh with mesh_shape of (1,n) or (n,1) is wrong, as the OpStrategy generated is 1D. ``` all_mesh_dim_strategies=[[[Replicate(), Replicate(), Replicate()], [Partial(sum), Shard(dim=1), Shard(dim=0)], [Shard(dim=0), Shard(dim=0), Replicate()], [Shard(dim=1), Replicate(), Shard(dim=1)]]] OpStrategy(all_strategies)::: [(R, R) -> R, (S(1), S(0)) -> P, (S(0), R) -> S(0), (R, S(1)) -> S(1)] @ mesh: (4, 1)[(R, R) -> R, (S(1), S(0)) -> P, (S(0), R) -> S(0), (R, S(1)) -> S(1)] @ mesh: (4, 1) ``` After the fix, we can see the OpStrategy generated is correct with 2D strategy. ``` all_mesh_dim_strategies=[[[Replicate(), Replicate(), Replicate()], [Partial(sum), Shard(dim=1), Shard(dim=0)], [Shard(dim=0), Shard(dim=0), Replicate()], [Shard(dim=1), Replicate(), Shard(dim=1)]]][[[Replicate(), Replicate(), Replicate()], [Partial(sum), Shard(dim=1), Shard(dim=0)], [Shard(dim=0), Shard(dim=0), Replicate()], [Shard(dim=1), Replicate(), Shard(dim=1)]]] OpStrategy(all_strategies) = [(RR, RR) -> RR, (RS(1), RS(0)) -> RP, (RS(0), RR) -> RS(0), (RR, RS(1)) -> RS(1), (S(1)R, S(0)R) -> PR, (S(1)S(1), S(0)S(0)) -> PP, (S(1)S(0), S(0)R) -> PS(0), (S(1)R, S(0)S(1)) -> PS(1), (S(0)R, RR) -> S(0)R, (S(0)S(1), RS(0)) -> S(0)P, (S(0)S(0), RR) -> S(0)S(0), (S(0)R, RS(1)) -> S(0)S(1), (RR, S(1)R) -> S(1)R, (RS(1), S(1)S(0)) -> S(1)P, (RS(0), S(1)R) -> S(1)S(0), (RR, S(1)S(1)) -> S(1)S(1)] @ mesh: (4, 1) ``` ***** As a follow up, we should add more test coverage for DTensor op with 2D mesh and 2D mesh with one of the size of mesh dimension being 1. ***** Pull Request resolved: https://github.com/pytorch/pytorch/pull/139134 Approved by: https://github.com/fegin	2024-10-30 08:09:39 +00:00
Nikita Shulga	ceab24def4	[CI] Unify numpy version for python-3.9 and 3.10 configs (#139244 ) Per dependabot numpy-1.21 is subject of CVE-2021-34141 so perhaps it's ok not to test against it Pull Request resolved: https://github.com/pytorch/pytorch/pull/139244 Approved by: https://github.com/huydhn	2024-10-30 06:47:38 +00:00
Scott Wolchok	3495ef78a2	Unbreak fp16 dot issues caused by #137917 (#139262 ) See comment for explanation. In short, doing the fixup in float. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139262 Approved by: https://github.com/huydhn	2024-10-30 05:10:19 +00:00
cyy	4e5f9afc7f	Enable c10::sv and std::sv constexpr conversions (#139239 ) As a small step towards moving c10::sv to std::sv and this tiny change shouldn't break META builds. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139239 Approved by: https://github.com/malfet	2024-10-30 03:57:47 +00:00
leslie-fang-intel	cd8f7730f4	[PT2E][Quant] Remove Redundant Method in X86 Quantizer (#139161 ) Summary Remove the redundant method of X86 Inductor Quantizer as `get_supported_quantization_configs`, `get_supported_operator_for_quantization_config` and `get_supported_operators`. They are not the must have to implement a customized Quantizer and not mentioned in existing document for how to use X86 Inductor Quantizer. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139161 Approved by: https://github.com/jgong5	2024-10-30 03:31:17 +00:00
Xia, Weiwen	edcab61f93	Skip test for PT2E quantized ops in fbcode (#138792 ) Skip those tests as they are failing in fbcode. Submit this PR per request from @jerryzh168 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138792 Approved by: https://github.com/jerryzh168	2024-10-30 02:37:38 +00:00
eqy	b4e4f84a06	Fix regex in `test_static_inputs_address_mutation_log` for Python 3.12 (#139229 ) Otherwise Python 3.12's `re` seems to be unhappy with `re.error: global flags not at the start of the expression at position 113` Pull Request resolved: https://github.com/pytorch/pytorch/pull/139229 Approved by: https://github.com/ezyang	2024-10-30 02:36:31 +00:00
cyy	b0f84aad5d	[3/N] Fix Wextra-semi warnings (#139165 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/139165 Approved by: https://github.com/ezyang	2024-10-30 02:08:13 +00:00
PyTorch MergeBot	5861279f47	Revert "Add support for index_put_ in NT (#135722 )" This reverts commit b4836e5b5ce2891e9af21790d255720e2dbf8e91. Reverted https://github.com/pytorch/pytorch/pull/135722 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but it is failing on ROCm ([comment](https://github.com/pytorch/pytorch/pull/135722#issuecomment-2445651914))	2024-10-30 01:53:55 +00:00
Richard Barnes	1797a2035d	Drop caffe2 string_utils (#139217 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/139217 Approved by: https://github.com/Skylion007, https://github.com/cyyever	2024-10-30 01:13:16 +00:00
cyy	da1c1a9884	[4/N] Don't skip ASAN on some tests (#139189 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/139189 Approved by: https://github.com/ezyang	2024-10-30 00:59:32 +00:00
Nikita Shulga	ba40dc19d2	[CI] Run aarch64 build/tests on every trunk commit (#139228 ) As we have sccache now, should be reasonably fast Pull Request resolved: https://github.com/pytorch/pytorch/pull/139228 Approved by: https://github.com/kit1980	2024-10-30 00:49:06 +00:00
Nikita Shulga	f643499ddd	Fix `vec128_half_neon.h` compilation with GCC (#139235 ) `mask` is already defined as `uint16x8_t` no need to reinterpret it `bd369bb182/aten/src/ATen/cpu/vec/vec128/vec128_half_neon.h (L220)` Fixes ``` var/lib/jenkins/workspace/aten/src/ATen/cpu/vec/vec128/vec128_half_neon.h: In static member function 'static at::vec::DEFAULT::Vectorized<c10::Half> at::vec::DEFAULT::Vectorized<c10::Half>::set(const at::vec::DEFAULT::Vectorized<c10::Half>&, const at::vec::DEFAULT::Vectorized<c10::Half>&, int64_t)': /var/lib/jenkins/workspace/aten/src/ATen/cpu/vec/vec128/vec128_half_neon.h:227:39: error: cannot convert 'uint16x8_t' to 'float16x8_t' 227 \| vreinterpretq_u16_f16(mask), \| ^~~~ \| \| \| uint16x8_t In file included from /var/lib/jenkins/workspace/aten/src/ATen/cpu/vec/intrinsics.h:23, from /var/lib/jenkins/workspace/aten/src/ATen/cpu/vec/vec128/vec128.h:4, from /var/lib/jenkins/workspace/aten/src/ATen/cpu/vec/vec.h:6, from /var/lib/jenkins/workspace/aten/src/ATen/test/vec_test_all_types.h:2, from /var/lib/jenkins/workspace/aten/src/ATen/test/vec_test_all_types.cpp:1: /usr/lib/gcc/aarch64-linux-gnu/11/include/arm_neon.h:5841:36: note: initializing argument 1 of 'uint16x8_t vreinterpretq_u16_f16(float16x8_t)' 5841 \| vreinterpretq_u16_f16 (float16x8_t __a) \| ~~~~~~~~~~~~^~~ ``` introduced by https://github.com/pytorch/pytorch/pull/137911 Also, guard any use of NEON intrinsics in `ReducedPrecisionFloatGemvFastPathKernel.cpp` with `!defined(CPU_CAPABILITY_SVE)` otherwise compilation fails with ``` /var/lib/jenkins/workspace/aten/src/ATen/native/cpu/ReducedPrecisionFloatGemvFastPathKernel.cpp: In function 'float at::native::SVE256::reduce(at::vec::SVE256::VectorizedN<c10::Half, 16>&)': /var/lib/jenkins/workspace/aten/src/ATen/native/cpu/ReducedPrecisionFloatGemvFastPathKernel.cpp:77:24: error: cannot convert 'at::vec::SVE256::Vectorized<float>' to 'float32x4_t' 77 \| return vaddvq_f32(t0 + t1); \| ~~~^~~~ \| \| \| at::vec::SVE256::Vectorized<float> In file included from /var/lib/jenkins/workspace/c10/util/Half.h:51, from /var/lib/jenkins/workspace/c10/util/Float8_e5m2.h:17, from /var/lib/jenkins/workspace/c10/core/ScalarType.h:8, from /var/lib/jenkins/workspace/c10/core/TensorImpl.h:11, from /var/lib/jenkins/workspace/c10/core/GeneratorImpl.h:8, from /var/lib/jenkins/workspace/aten/src/ATen/core/Generator.h:18, from /var/lib/jenkins/workspace/aten/src/ATen/CPUGeneratorImpl.h:3, from /var/lib/jenkins/workspace/aten/src/ATen/Context.h:4, from /var/lib/jenkins/workspace/aten/src/ATen/native/cpu/ReducedPrecisionFloatGemvFastPathKernel.cpp:2, from /var/lib/jenkins/workspace/build/aten/src/ATen/native/cpu/ReducedPrecisionFloatGemvFastPathKernel.cpp.SVE256.cpp:1: /usr/lib/gcc/aarch64-linux-gnu/11/include/arm_neon.h:10423:25: note: initializing argument 1 of 'float32_t vaddvq_f32(float32x4_t)' 10423 \| vaddvq_f32 (float32x4_t __a) \| ~~~~~~~~~~~~^~~ In file included from /var/lib/jenkins/workspace/build/aten/src/ATen/native/cpu/ReducedPrecisionFloatGemvFastPathKernel.cpp.SVE256.cpp:1: /var/lib/jenkins/workspace/aten/src/ATen/native/cpu/ReducedPrecisionFloatGemvFastPathKernel.cpp: In function 'float at::native::SVE256::reduce(at::vec::SVE256::Vectorized<float>)': /var/lib/jenkins/workspace/aten/src/ATen/native/cpu/ReducedPrecisionFloatGemvFastPathKernel.cpp:119:21: error: cannot convert 'at::vec::SVE256::Vectorized<float>' to 'float32x4_t' 119 \| return vaddvq_f32(x); \| ^ \| \| \| at::vec::SVE256::Vectorized<float> ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/139235 Approved by: https://github.com/huydhn	2024-10-30 00:48:57 +00:00
Angela Yi	d9e87fb339	[draft-export] Include guards for constraint violation errors (#138748 ) Summary: Added where logs are being added to constrain violations in draft export. Example output: ``` 1. Constraint violation error. The specified input dynamic_shapes spec was found to be incorrect during tracing. Specifically, this guard was added: Eq(s0, 3), where {'s0': "L['args'][0][0].size()[0]"}. This occured at the following stacktrace: File /data/users/angelayi/fbsource/buck-out/v2/gen/fbcode/1beb9df83fd74b9a/scripts/angelayi/draft_export/__test_draft_export__/test_draft_export#link-tree/torch/nn/modules/module.py, lineno 1736, in _wrapped_call_impl File /data/users/angelayi/fbsource/buck-out/v2/gen/fbcode/1beb9df83fd74b9a/scripts/angelayi/draft_export/__test_draft_export__/test_draft_export#link-tree/torch/nn/modules/module.py, lineno 1747, in _call_impl File /data/users/angelayi/fbsource/buck-out/v2/gen/fbcode/1beb9df83fd74b9a/scripts/angelayi/draft_export/__test_draft_export__/test_draft_export#link-tree/scripts/angelayi/draft_export/test_draft_export.py, lineno 138, in forward. Because of this, we have modified the dynamic shapes structure to be the following: ``` dynamic_shapes = {'a': {0: 3}} ``` ``` The result of this diff is also that `dynamic` logs are permanently turned on during draft export. Otherwise we cannot capture the `[guard added]` logs from symbolic_shapes.py. Test Plan: `buck2 run @//mode/dev-nosan scripts/angelayi/draft_export:test_draft_export -- -r "test_shape_failure" ` Differential Revision: D64862374 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138748 Approved by: https://github.com/ezyang	2024-10-30 00:24:17 +00:00
Antoni Viros	b4836e5b5c	Add support for index_put_ in NT (#135722 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135722 Approved by: https://github.com/jbschlosser	2024-10-30 00:03:21 +00:00
Syed Tousif Ahmed	341a28f0ce	Refactors empty_cache to return only MemPool memory to the system (#133602 ) Canonically, the empty_cache API releases all cached blocks of the CUDACachingAllocator. There is no API that can release only the cached blocks of a given pool. In this PR, we extend the functionality of empty_cache API such that it only releases the cached blocks of an active pool. When empty_cache API is called under a MemPoolContext, we only release the cached blocks that correspond to the pool id of the active pool. Part of https://github.com/pytorch/pytorch/issues/124807. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133602 Approved by: https://github.com/ezyang	2024-10-29 23:58:44 +00:00
Nikita Shulga	bd369bb182	Workaround torch.deploy failures (#139195 ) Summary: Which are backed with an older version of `typing_extensoins` but this runtime could not care less about type-checking. So pretend that is has `TypeIs` by replacing it with `TypeGuard` Fixes test failures introduced by https://github.com/pytorch/pytorch/pull/133814 / D65030974 Test Plan: `buck2 test 'fbcode//mode/opt' fbcode//multipy/runtime:test_deploy -- --exact 'multipy/runtime:test_deploy - TorchpyTest.TestNumpy'` Differential Revision: D65145409 Pull Request resolved: https://github.com/pytorch/pytorch/pull/139195 Approved by: https://github.com/Skylion007	2024-10-29 23:36:16 +00:00
titaiwangms	fcb36a69cd	[ONNX] Add a test file for _building.py (#139107 ) Fixes #138761 Add test file for _building.py to verify and guarantee the correct behavior on OpRecorder. Noted that the tests does not validate the model itself, but the expected behavior of the evaluator adding extra ops during input preprocessing. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139107 Approved by: https://github.com/justinchuby	2024-10-29 23:25:31 +00:00
Colin L. Rice	a0e095dd9f	config: Modify install_config_module to use a layered approach (#138758 ) This modifies the config system, to use a single mapping of config -> ConfigEntry and to store the default and user values within them. We could have used multiple dicts (i.e. user_override and default), but as we add more fields (justknobs in this PR, perhaps testing and env variables later), it quickly becomes painful. There are a couple design decisions we could change. 1) All configs we save store the resolved value - not the default and user override seperately 2) All configs we load, apply the resolved value as a user override. This means that certain complexities of default behvaiour and deletion (as well as JK), will change if you save + load a config. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138758 Approved by: https://github.com/ezyang	2024-10-29 23:19:36 +00:00
cyyever	46d0b635b9	[CMake] Remove pthread linking (#134436 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/134436 Approved by: https://github.com/r-barnes	2024-10-29 23:14:40 +00:00
eqy	c9bd712305	[CUDA][AMP] Speed up fp16/bf16 casts on H100+ (#137053 ) Similar to #110251 we're seeing cases where vectorization can benefit casts to fp16/bf16 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137053 Approved by: https://github.com/drisspg	2024-10-29 23:01:16 +00:00
Scott Wolchok	b29c170bee	[PyTorch] Build ReducedPrecisionFloatGemvFastPathKernel & entry points for non-ARM architectures too (#137917 ) Remove reasons to gate it on ARM. Differential Revision: [D64280687](https://our.internmc.facebook.com/intern/diff/D64280687/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137917 Approved by: https://github.com/malfet ghstack dependencies: #137661, #137911, #137912, #137913, #137914, #137915, #137916	2024-10-29 22:38:01 +00:00
Scott Wolchok	fc2d0da773	[PyTorch] Convert reduced precision gemv vectorized tail loop to use whole vector register instead of half (#137916 ) The fixup loop doesn't really need to vectorize the last 7 elements, and not doing so will make migrating to x86 simpler. Differential Revision: [D64280689](https://our.internmc.facebook.com/intern/diff/D64280689/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137916 Approved by: https://github.com/malfet ghstack dependencies: #137661, #137911, #137912, #137913, #137914, #137915	2024-10-29 22:38:01 +00:00
Scott Wolchok	5be1556d4a	[PyTorch] Clean up Registers/ElementsPerIteration constants (#137915 ) In preparation for other vector instruction sets. (NEON and AVX512 have 32 registers, but AVX and AVX2 have only 16.) Differential Revision: [D64265759](https://our.internmc.facebook.com/intern/diff/D64265759/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137915 Approved by: https://github.com/Skylion007, https://github.com/malfet ghstack dependencies: #137661, #137911, #137912, #137913, #137914	2024-10-29 22:37:49 +00:00
Scott Wolchok	aafbea49b9	[PyTorch] Move FP16 dot and GEMV kernels to new file in ATen/native/cpu/ (#137914 ) This is in preparation for supporting x86 as well; we need to be in this directory so that we can get rebuilt with different CPU_CAPABILITY settings (AVX2/AVX-512). Also incidentally starts fulfilling request from @malfet to split the ARM64 fast path stuff into its own file. BFloat16 will be in a later diff. Differential Revision: [D64265755](https://our.internmc.facebook.com/intern/diff/D64265755/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137914 Approved by: https://github.com/Skylion007, https://github.com/malfet ghstack dependencies: #137661, #137911, #137912, #137913	2024-10-29 22:37:37 +00:00
Scott Wolchok	6502d6cf17	[PyTorch] Use Half, not float16_t, in fp16 gemv fast path signatures (#137913 ) float16_t is ARM-specific. Half is not. Differential Revision: [D64218427](https://our.internmc.facebook.com/intern/diff/D64218427/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137913 Approved by: https://github.com/Skylion007, https://github.com/malfet ghstack dependencies: #137661, #137911, #137912	2024-10-29 22:37:30 +00:00
Scott Wolchok	9ede4b2746	[PyTorch] Migrate fp16 gemv fast path kernel from intrinsics to vec::Vectorized (#137912 ) Migrated as much as possible and convenient; focusing on fp16 for now. (This is building toward enabling these fast paths on x86 for machines without AVX-512fp16/bf16 to fix https://github.com/pytorch/torchchat/issues/1253 .) Differential Revision: [D64218206](https://our.internmc.facebook.com/intern/diff/D64218206/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137912 Approved by: https://github.com/malfet ghstack dependencies: #137661, #137911	2024-10-29 22:37:24 +00:00
Scott Wolchok	41d7471413	[PyTorch] Specialize Vectorized<Half> for NEON even if FP16 arithmetic isn't available (#137911 ) We can do most of what this header does (by line count) anyway by converting to and from float. Differential Revision: [D64265757](https://our.internmc.facebook.com/intern/diff/D64265757/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137911 Approved by: https://github.com/jgong5, https://github.com/malfet ghstack dependencies: #137661	2024-10-29 22:37:17 +00:00
Scott Wolchok	837538f040	[PyTorch] Move NEON VecConvert specialization from vec256_convert to vec128_convert (#137661 ) NEON vectors are 128-bit and don't belong with 256 stuff. Differential Revision: [D64143615](https://our.internmc.facebook.com/intern/diff/D64143615/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137661 Approved by: https://github.com/jgong5, https://github.com/malfet	2024-10-29 22:37:10 +00:00
Joel Schlosser	23d590e518	More flexible test parametrization with @reparametrize (#138369 ) Background: The `@parametrize` decorator enjoys widespread usage as a convenient tool for ensuring extensive test coverage. One particular feature that makes this easy is the ability to stack such decorators, testing over the cross-product of inputs. Example: ```python class MyTestClass(TestCase): @parametrize("x", range(3)) @parametrize("y", [False, True]) def test_foo(self, x, y): # Invoked with: # x=0, y=False # x=1, y=False # x=2, y=False # x=0, y=True # x=1, y=True # x=2, y=True ... ``` Note that the `@ops` and `@modules` decorators employ the same underlying machinery for parametrizing over `OpInfo` / `ModuleInfo` entries. These decorators also parametrize over op-specific `device` / `dtype` info according to what is supported for each op. ```python class MyTestClass(TestCase): @ops(op_db) def test_foo(self, op, device, dtype): # Invoked each OpInfo in the db along with each device / dtype that corresponds # with this op according to the OpInfo entry. ... ``` Note that this in contrast to the naive cross product between ops and devices / dtypes, which would generate too many tests. Certain use cases benefit from a similar type of flexible parametrization that is more intelligent than simple cross-product composition. It is expensive to generate / run too many tests, even if the unneeded ones are skipped appropriately. This PR attempts to generalize such flexible parametrization and satisfy these use cases through the introduction of a `@reparametrize` decorator, which operates on an existing parametrizer and allows for customized on-the-fly parametrization through the use of an `adapter_fn`. Examples: ```python # adapter_fn that adds a new arg def include_is_even_arg(test_name, param_kwargs): x = param_kwargs["x"] is_even = x % 2 == 0 new_param_kwargs = dict(param_kwargs) new_param_kwargs["is_even"] = is_even is_even_suffix = "_even" if is_even else "_odd" new_test_name = f"{test_name}{is_even_suffix}" yield (new_test_name, new_param_kwargs) # adapter_fn that excludes certain values def exclude_odds(test_name, param_kwargs): x = param_kwargs["x"] is_even = x % 2 == 0 yield None if not is_even else (test_name, param_kwargs) class MyTestClass(TestCase): @reparametrize(parametrize("x", range(5)), include_is_even_arg) def test_foo(self, x, is_even): # Invoked with both the x value and the new is_even arg ... @reparametrize(parametrize("x", range(5)), exclude_odds) def test_bar(self, x): # Only invoked with even x values ... ``` For a more real-world use case, imagine you want to write a set of OpInfo tests that parametrize over additional op-specific things beyond `device` / `dtype` (in NJT's case, this includes contiguity type, whether to operate over the batch / ragged / other dims, etc.). The `@reparametrize` decorator allows you to customize the `@ops` parametrization to add in these additional args as they make sense on a per-op basis. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138369 Approved by: https://github.com/janeyx99	2024-10-29 22:14:38 +00:00
Jean Schmidt	ebaa774f96	Migrate inductor and torchbench workflows to start experimenting with a100 on aws (#139079 ) Excluding nightly workflows, as they are more critical and run less frequently. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139079 Approved by: https://github.com/malfet, https://github.com/ZainRizvi, https://github.com/huydhn	2024-10-29 22:11:25 +00:00
drisspg	80c7c7178e	Make sure all SDPA tests are ran with tensor cores enabled (#135592 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135592 Approved by: https://github.com/eqy	2024-10-29 20:53:10 +00:00
Nikita Shulga	c81d4fd0a8	Upgrade sccache to v0.8.2 for CPU targets (#121323 ) This essentially reverts https://github.com/pytorch/pytorch/pull/95997 but switches to builds from source to official mozilla's sccache repo for CPU builds, except PCH one, see https://github.com/pytorch/pytorch/issues/139188 - Define `SCCACHE_REGION` for the jobs that needs it. - Enable aarch64 builds to use sccache, which allows one to do incremental rebuilds under 10 min, see https://github.com/pytorch/pytorch/actions/runs/11565944328/job/32197278296 Fixes https://github.com/pytorch/pytorch/issues/121559 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121323 Approved by: https://github.com/atalman	2024-10-29 19:54:36 +00:00
Jake Schmidt	2b577ae58f	Implement NJT embedding backward (#138627 ) Fixes #138352 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138627 Approved by: https://github.com/jbschlosser	2024-10-29 18:44:58 +00:00
drisspg	a884462bca	Add workspace to TritonTemplates (#138050 ) Here's a markdown summary for the PR: # Add workspace buffer support for Triton templates ## Summary Adds support for templates to allocate and use temporary workspace buffers ## Key Changes - Add `WorkspaceArg` support in Triton template system - Automatic workspace allocation/deallocation around kernel execution - Zero-initialization support for workspace buffers - Seamless integration with existing tensor management ## Example Usage ```python def generate(self, ...): workspace_arg = WorkspaceArg( count=1024*1024, # 1MB workspace zero_fill=True # Zero-initialized ) return TritonTemplateCaller(..., workspace_arg=workspace_arg) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/138050 Approved by: https://github.com/Chillee, https://github.com/eellison	2024-10-29 18:17:54 +00:00
Xilun Wu	7964bcc3dc	[DeviceMesh] fix sub mesh size calculation in create_sub_mesh() (#138945 ) Summary This PR fixes a calculation miss in DeviceMesh's create_sub_mesh(). Error Description When users call `device_mesh["dim0", "dim1", "dim2", "dim3"]`, it creates a slice of mesh or we call it "submesh". Users can also slice a submesh from a flattened mesh. For example: ``` flattened_mesh = device_mesh["dim0", "dim1", "dim2"]._flatten("dim0-2") alias_flattened_mesh = device_mesh["dim0-2"] # this mesh slice leads to error in current impl ``` It triggers the error in the size calculation `reduce(lambda, mesh_dim)` happening in `create_sub_mesh`: ``` IndexError: Dimension out of range (expected to be in range of [-4, 3], but got 4) ``` Fix The usage of lambda is wrong, for `lambda x, y`, the x is the accumulated value while `y` is the iterator value. Test `pytest test/distributed/test_device_mesh.py -s -k test_flatten_mesh_4d` Pull Request resolved: https://github.com/pytorch/pytorch/pull/138945 Approved by: https://github.com/wz337	2024-10-29 17:56:56 +00:00
cyy	82a6d2db3f	[2/N] Fix clang-tidy warnings in python_variable_methods.cpp (#139158 ) Follows #139007 Pull Request resolved: https://github.com/pytorch/pytorch/pull/139158 Approved by: https://github.com/Skylion007	2024-10-29 17:16:37 +00:00
Abatom	c98c88a211	[Bugfix] UnicodeDecodeError: 'utf-8' codec can't decode byte (#139062 ) Fixes #113564 When I used PyTorch's profiler to analyze the performance of vLLM, I encountered the following error. This error is similar to #113564. After analysis and troubleshooting, I changed the temporary file from text mode to binary mode, and it no longer reported an error and ran normally. ```bash ERROR 10-28 10:25:50 engine.py:160] File "/usr/local/lib/python3.12/dist-packages/torch/profiler/profiler.py", line 722, in stop ERROR 10-28 10:25:50 engine.py:160] self._transit_action(self.current_action, None) ERROR 10-28 10:25:50 engine.py:160] File "/usr/local/lib/python3.12/dist-packages/torch/profiler/profiler.py", line 751, in _transit_action ERROR 10-28 10:25:50 engine.py:160] action() ERROR 10-28 10:25:50 engine.py:160] File "/usr/local/lib/python3.12/dist-packages/torch/profiler/profiler.py", line 745, in _trace_ready ERROR 10-28 10:25:50 engine.py:160] self.on_trace_ready(self) ERROR 10-28 10:25:50 engine.py:160] File "/usr/local/lib/python3.12/dist-packages/torch/profiler/profiler.py", line 444, in handler_fn ERROR 10-28 10:25:50 engine.py:160] prof.export_chrome_trace(os.path.join(dir_name, file_name)) ERROR 10-28 10:25:50 engine.py:160] File "/usr/local/lib/python3.12/dist-packages/torch/profiler/profiler.py", line 220, in export_chrome_trace ERROR 10-28 10:25:50 engine.py:160] fout.writelines(fin) ERROR 10-28 10:25:50 engine.py:160] File "<frozen codecs>", line 322, in decode ERROR 10-28 10:25:50 engine.py:160] UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8e in position 5896: invalid start byte ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/139062 Approved by: https://github.com/ezyang	2024-10-29 17:16:26 +00:00
Boyuan Feng	68134a320e	[Flex Attention] Paged Attention (#137164 ) This PR adds paged attention for flex attention. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137164 Approved by: https://github.com/drisspg	2024-10-29 17:05:22 +00:00
cyy	3907f36808	Turn some variables and functions into static (#136847 ) Re-check some files and mark variables and functions into static and fix other warnings. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136847 Approved by: https://github.com/ezyang	2024-10-29 17:01:56 +00:00
Henry Tsang	3f9f6048da	[aoti] Print output name for sympy.Expr as well (#138524 ) To avoid ``` NotImplementedError: unsupported type of output=s0s1 ``` It seems like this was caused by the use of `_scaled_dot_product_flash_attention`. Fallback kernek: ``` FallbackKernel( python_kernel_name='torch.ops.aten._scaled_dot_product_flash_attention.default', name=buf55, layout=MultiOutputLayout(device=device(type='cuda', index=0)), inputs=[ComputedBuffer(name='buf52', layout=FixedLayout('cuda', torch.bfloat16, size=[1, 6, s0s1, 64], stride=[384s0s1, 64s0s1, 64, 1]), data=Pointwise(device=device(type='cuda', index=0), dtype=torch.bfloat16, inner_fn=<function BaseView.make_loader.<locals>.loader at 0x7fcd7f99da20>, ranges=[1, 6, s0s1, 64])), ComputedBuffer(name='buf53', layout=FixedLayout('cuda', torch.bfloat16, size=[1, 6, s0s1, 64], stride=[384s0s1, 64s0s1, 64, 1]), data=Pointwise(device=device(type='cuda', index=0), dtype=torch.bfloat16, inner_fn=<function BaseView.make_loader.<locals>.loader at 0x7fcd7f99d480>, ranges=[1, 6, s0s1, 64])), ComputedBuffer(name='buf54', layout=FixedLayout('cuda', torch.bfloat16, size=[1, 6, s0s1, 64], stride=[384s0s1, 64s0s1, 64, 1]), data=Pointwise(device=device(type='cuda', index=0), dtype=torch.bfloat16, inner_fn=<function BaseView.make_loader.<locals>.loader at 0x7fcd7f99c430>, ranges=[1, 6, s0s1, 64]))], constant_args=(0.125,), kwargs={'scale': 0.125}, output_view=None, python_kernel_name=torch.ops.aten._scaled_dot_product_flash_attention.default, cpp_kernel_name=at::_ops::_scaled_dot_product_flash_attention::call, ordered_kwargs_for_cpp_kernel=['scale'], op_overload=aten._scaled_dot_product_flash_attention.default, arg_properties=[{'name': 'query', 'type': Tensor, 'default_value': None}, {'name': 'key', 'type': Tensor, 'default_value': None}, {'name': 'value', 'type': Tensor, 'default_value': None}, {'name': 'dropout_p', 'type': float, 'default_value': 0.0}, {'name': 'is_causal', 'type': bool, 'default_value': False}, {'name': 'return_debug_mask', 'type': bool, 'default_value': False}], kwarg_properties=None, unbacked_bindings=None, mutation_outputs=[], origin_node=None, origins=OrderedSet([_scaled_dot_product_flash_attention]) ) ``` codegen with this pr ``` // Topologically Sorted Source Nodes: [scaled_dot_product_attention], Original ATen: [aten._scaled_dot_product_flash_attention] double var_147 = 0.125; AtenTensorHandle buf56_handle; AtenTensorHandle buf57_handle; auto buf55_4 = s0s1; auto buf55_5 = s0*s1; AtenTensorHandle buf58_handle; AtenTensorHandle buf59_handle; AtenTensorHandle buf60_handle; AOTI_TORCH_ERROR_CODE_CHECK(aoti_torch_cuda__scaled_dot_product_flash_attention(convert_arrayref_tensor_to_tensor(buf52), convert_arrayref_tensor_to_tensor(buf53), convert_arrayref_tensor_to_tensor(buf54), 0.0, 0, 0, &var_147, &buf56_handle, &buf57_handle, nullptr, nullptr, &buf55_4, &buf55_5, &buf58_handle, &buf59_handle, &buf60_handle)); RAIIAtenTensorHandle buf56(buf56_handle); RAIIAtenTensorHandle buf57(buf57_handle); RAIIAtenTensorHandle buf58(buf58_handle); RAIIAtenTensorHandle buf59(buf59_handle); RAIIAtenTensorHandle buf60(buf60_handle); ``` Differential Revision: D64724460 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138524 Approved by: https://github.com/chenyang78	2024-10-29 16:02:45 +00:00
Jason Ansel	a762dc0357	[inductor] Multi-kernel + cooperative reductions (#138893 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138893 Approved by: https://github.com/shunting314 ghstack dependencies: #138533	2024-10-29 15:45:17 +00:00
Jason Ansel	77b0ae832d	[inductor] Allow cooperative + persistent reductions (#138533 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138533 Approved by: https://github.com/shunting314, https://github.com/eellison	2024-10-29 15:45:17 +00:00
Vishwa Raj Singh	9d7a0869f0	Make DDP Quantization hooks backend Agnostic (#138816 ) Current ddp hooks quantization code use .cuda() API to move tensors and parameter on backend devices. This limits only cuda backend to work with ddp quantization hooks. Change is to make code backend agnostic and move tensors/parameters based on tensor.device. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138816 Approved by: https://github.com/kwen2501	2024-10-29 15:02:45 +00:00
Nikita Shulga	869d1ad0b4	[BE] Nested namespace in quantized folder (#139166 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/139166 Approved by: https://github.com/cyyever, https://github.com/Skylion007	2024-10-29 14:53:07 +00:00
Wu, Chunyuan	489c66fdb3	[AOTI] fix pointer_to_list (#138806 ) Fixes the `pointer_to_list` function to take `(ptr + i)` instead of `ptr`. This fixes the runtime error when running INT8 yolo-v7. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138806 Approved by: https://github.com/jgong5, https://github.com/desertfire ghstack dependencies: #138691	2024-10-29 14:33:16 +00:00
Wu, Chunyuan	9af1816974	[AOTI] add C shim for _weight_int8pack_mm (#138691 ) Fixes the error of running WOQ-INT8 LLaMA: ``` E In file included from /home/user/inductor/pytorch/torch/include/torch/csrc/inductor/aoti_runtime/arrayref_tensor.h:3, E from /tmp/torchinductor_user/sw/csw5gfmlzp5iooqvfwl2gwn574frwdpmtrx2y6nu2m6x76d3xcux.cpp:4: E /tmp/torchinductor_user/sw/csw5gfmlzp5iooqvfwl2gwn574frwdpmtrx2y6nu2m6x76d3xcux.cpp: In function ‘void inductor_entry_impl(AtenTensorOpaque, AtenTensorOpaque)’: E /tmp/torchinductor_user/sw/csw5gfmlzp5iooqvfwl2gwn574frwdpmtrx2y6nu2m6x76d3xcux.cpp:117:33: error: ‘aoti_torch_cpu__weight_int8pack_mm’ was not declared in this scope E 117 \| AOTI_TORCH_ERROR_CODE_CHECK(aoti_torch_cpu__weight_int8pack_mm(convert_arrayref_tensor_to_tensor(arg8_1), _frozen_param0, _frozen_param1, &buf0_handle)); E \| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/138691 Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5, https://github.com/desertfire	2024-10-29 13:53:36 +00:00
amathewc	69d401d010	Update test_quantize_pt2e.py with HPU support (#137863 ) MOTIVATION We recently integrated support for Intel Gaudi devices (identified as 'hpu') into the common_device_type framework via the pull request at https://github.com/pytorch/pytorch/pull/126970. This integration allows tests to be automatically instantiated for Gaudi devices upon loading the relevant library. Building on this development, the current pull request extends the utility of these hooks by adapting selected CUDA tests to operate on Gaudi devices. Additionally, we have confirmed that these modifications do not interfere with the existing tests on CUDA devices. CHANGES - Add support for HPU devices within the test_move_exported_model_bn using TEST_HPU flag - Use instantiate_device_type_tests with targeted attributes to generate device-specific test instances. - Apply skipIfHPU decorator to bypass tests that are not yet compatible with HPU devices. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137863 Approved by: https://github.com/jerryzh168	2024-10-29 13:01:03 +00:00
Yuanhao Ji	b9618c9b88	[Dynamo] Add `itertools.compress()` support (#139061 ) Use polyfill to add `itertools.compress()` support in Dynamo. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139061 Approved by: https://github.com/jansel	2024-10-29 10:25:55 +00:00
cyy	e201460f8a	[2/N] Fix Wextra-semi warnings (#139142 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/139142 Approved by: https://github.com/ezyang	2024-10-29 08:14:37 +00:00
Sam Ginzburg	93d7f90c3a	[inductor] getting AOT inductor to treat None args correctly (#139114 ) Differential Revision: [D65102228](https://our.internmc.facebook.com/intern/diff/D65102228) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139114 Approved by: https://github.com/aakhundov	2024-10-29 08:11:53 +00:00
Nikita Shulga	8b08559c80	Move more workflows to 3.9 (#139145 ) Specifically mergebot and others should be using 3.9 now Pull Request resolved: https://github.com/pytorch/pytorch/pull/139145 Approved by: https://github.com/kit1980, https://github.com/Skylion007, https://github.com/huydhn	2024-10-29 05:39:46 +00:00
PyTorch MergeBot	38645e8a3e	Revert "Fix unbind_copy and add its decomposition (#134319 )" This reverts commit 8aedc649bdd0789b0ea9b9348d552fb1b0e437ff. Reverted https://github.com/pytorch/pytorch/pull/134319 on behalf of https://github.com/huydhn due to Sorry for reverting your PR, but this is still failing the same test on ExecuTorch ([comment](https://github.com/pytorch/pytorch/pull/134319#issuecomment-2443209139))	2024-10-29 04:54:37 +00:00
chuanqiw	ea93e09896	[CI] Align XPU CI build with CD to fix build issue (#139050 ) Works for #114850 Pull Request resolved: https://github.com/pytorch/pytorch/pull/139050 Approved by: https://github.com/ezyang	2024-10-29 04:53:53 +00:00
Yuanhao Ji	e52ccb3ca6	[Device] Replace hardcoded devices with 'torch._C._get_accelerator()' (#139032 ) I noticed that some hard-code like `"cuda" if torch.cuda.is_available() else "cpu"` which can be replaced with `torch._C._get_accelerator()` Pull Request resolved: https://github.com/pytorch/pytorch/pull/139032 Approved by: https://github.com/ezyang	2024-10-29 04:51:47 +00:00
cyy	a0865b00fb	[1/N] Fix clang-tidy warnings in python_variable_methods.cpp (#139007 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/139007 Approved by: https://github.com/ezyang	2024-10-29 04:48:13 +00:00
cyy	0274d16c01	Fix clang-tidy warnings in jit code (#138974 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/138974 Approved by: https://github.com/ezyang	2024-10-29 04:33:40 +00:00
Yiming Zhou	48b55ca1b1	[export] Fix non-strict retracing with kwargs (#138927 ) Summary: `torch.fx.Interpreter.run()` only takes args as input. Currently we pass kwargs as well which causes errors during retracing. Flatten the kwargs and concat them with args will solve the issue. Several previously failing tests under `_retraceability_non_strict` now passes. Test Plan: ``` buck2 test @//mode/dev-nosan //caffe2/test:test_export -- -r _retraceability_non_strict ``` Differential Revision: D64980053 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138927 Approved by: https://github.com/angelayi	2024-10-29 04:31:21 +00:00
Nikita Shulga	3342b533bb	Update setuptool to 72.1.0 (#139144 ) As older versions are affected by CVE-2024-6345 Also, update `typing_extensions` to 4.11 to support `TypeIs`, otherwise some of the workflows report following error (but succeed somehow), see [this](https://github.com/pytorch/pytorch/actions/runs/11566785190/job/32196549021): ``` 2024-10-29T03:55:01.3601410Z + /Users/ec2-user/runner/_work/_temp/miniconda/bin/conda run -p /Users/ec2-user/runner/_work/_temp/conda_environment_11566785190 --no-capture-output python3 -c 'import torch' 2024-10-29T03:55:01.3602260Z ~/runner/_work/_temp ~/runner/_work/pytorch/pytorch 2024-10-29T03:55:01.8043630Z Traceback (most recent call last): 2024-10-29T03:55:01.8044540Z File "<string>", line 1, in <module> 2024-10-29T03:55:01.8045670Z File "/Users/ec2-user/runner/_work/_temp/conda_environment_11566785190/lib/python3.9/site-packages/torch/__init__.py", line 37, in <module> 2024-10-29T03:55:01.8046690Z from typing_extensions import ParamSpec as _ParamSpec, TypeIs as _TypeIs 2024-10-29T03:55:01.8048010Z ImportError: cannot import name 'TypeIs' from 'typing_extensions' (/Users/ec2-user/runner/_work/_temp/conda_environment_11566785190/lib/python3.9/site-packages/typing_extensions.py) ``` Also delete macOS-X86 as we no longer build those Pull Request resolved: https://github.com/pytorch/pytorch/pull/139144 Approved by: https://github.com/Skylion007, https://github.com/kit1980, https://github.com/huydhn	2024-10-29 04:24:51 +00:00
Scott Wolchok	61d0686168	[PyTorch] Use intrusive_ptr(p, DontIncreaseRefcount) directly in TensorBase unsafe borrow ctor (#138934 ) We observed ASAN failures stemming from `5ea6777861/torch/csrc/autograd/python_variable.cpp (L403)` . Since it's possible that `tensor` is dead here, `borrowed()` needs to avoid dereferencing it. `intrusive_ptr::reclaim` dereferences the pointer in builds with debug checks enabled, so use the DontIncreaseRefcount ctor directly instead. Differential Revision: [D64990707](https://our.internmc.facebook.com/intern/diff/D64990707/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138934 Approved by: https://github.com/ezyang	2024-10-29 04:20:11 +00:00
PyTorch MergeBot	6aef58a249	Revert "Dont decompose aten.baddmm in inductor (#137904 )" This reverts commit c066f4a055020ae994dd10a1b1fafbe3774108cd. Reverted https://github.com/pytorch/pytorch/pull/137904 on behalf of https://github.com/huydhn due to Sorry for reverting your change but I think the test is failing in trunk, maybe a landrace? ([comment](https://github.com/pytorch/pytorch/pull/137904#issuecomment-2443158194))	2024-10-29 04:08:11 +00:00
Will Feng	4ee514144b	[c10d][Partial-Graph Overlap] Support calling .wait_tensor() on output tensor of eager `async_op=True` collective if under `allow_inflight_collective_as_graph_input_ctx()` context manager (#137763 ) This PR aims to support the following use case: ```python def all_reduce_eager(x): y = x * x req = dist.all_reduce(y, op=dist.ReduceOp.SUM, async_op=True) assert isinstance(req, torch.distributed.Work) return y @torch.compile(fullgraph=True) def all_reduce_wait_compiled(y): torch.ops.c10d_functional.wait_tensor(y) return y * y x = torch.ones(1280, 1280, device="cuda") + self.rank with allow_inflight_collective_as_graph_input_ctx(): y = all_reduce_eager(x) z = all_reduce_wait_compiled(y) ``` where the collective is issued in eager (with `async_op=True`) but waited in compiled region. This is important for internal use cases such as TorchRec, where we issue collectives in eager for SparseArch all_to_all but want to wait for them in compiled region at beginning of OverArch, so that the all_to_all can be overlapped with the DenseArch compute that runs in parallel. ---- Update: Did two items to prevent regression to existing use cases: 1. Added memory-stressed test case to test_c10d_nccl.py `test_unwaited` to cover existing user's "not calling work.wait() for non-functional collective" use case 2. Gated all new `register_work()` / `unregister_work()` calls with `c10d::allow_inflight_collective_as_graph_input()` check, which is a new context manager that requires explicit user enablement (i.e. not on by default, so should not affect existing users). The risk of this new version of PR causing regression should be very low. ------ Test commands: - `pytest -rA test/distributed/test_inductor_collectives.py::TestCollectivesMultiProc::test_eager_async_allreduce_inductor_wait` - `pytest -rA test/test_fx.py::TestDCE::test_keep_collectives` - `pytest -rA test/test_fx.py::TestDCE::test_keep_collectives_no_overload` - `pytest -rA test/distributed/test_c10d_functional_native.py::TestWithNCCL::test_wait_tensor` - `pytest -rA test/distributed/test_c10d_functional_native.py::TestWithNCCL::test_unwaited` - `pytest -rA test/distributed/test_c10d_nccl.py::CommTest::test_wait_tensor` - `pytest -rA test/distributed/test_c10d_nccl.py::CommTest::test_unwaited` - `pytest -rA test/distributed/_tensor/test_tensor_ops.py::DistTensorOpsTest::test_equal` - `pytest -rA test/distributed/_tensor/test_random_ops.py::DistTensorRandomOpTest::test_manual_seed` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_ddp_baseline_aot_eager_multiprocess` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_aot_eager` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_setattr` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_unspecialized_forced_getattr_inline` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_unspecialized_forced_getattr_no_inline` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_asymmetric_compilation` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_compiler_collectives_automatic_dynamic_scalar` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_compiler_collectives_automatic_dynamic_speculation_divergence` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_compiler_collectives_automatic_dynamic_tensor` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_compiler_collectives_dim_mismatch` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_compiler_collectives_graph_break_empty_graph_still_collective` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_compiler_collectives_missing_source` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_compiler_collectives_scalar_missing_source` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_compiler_collectives_type_mismatch` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_ddp_activation_checkpointing` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_ddp_baseline_aot_eager_multiprocess` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_activation_checkpointing` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_aot_eager` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_inductor` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_setattr` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_unspecialized_forced_getattr_inline` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_unspecialized_forced_getattr_no_inline` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_hf_bert_ddp_aot_eager` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_hf_bert_ddp_aot_eager_static_graph` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_hf_bert_ddp_inductor` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_hf_bert_ddp_inductor_static_graph` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_hf_bert_fsdp_activation_checkpointing` - `pytest -rA test/distributed/_tensor/test_experimental_ops.py::DistOtherOpsTest::test_bernoulli` - `pytest -rA test/distributed/_tensor/test_dtensor_compile.py::TestDTensorCompileE2E::test_tp_compile_fullgraph_is_seq_parallel_True` - `pytest -rA test/distributed/test_inductor_collectives.py::TestCollectivesMultiProc::test_allreduce_inductor_cudagraph_trees` - `python benchmarks/dynamo/torchbench.py --ci --accuracy --timing --explain --inductor --device cuda --inference --bfloat16 --total-partitions 2 --partition-id 1 --output inference_torchbench.csv --only moco` ------ Differential Revision: [D65023311](https://our.internmc.facebook.com/intern/diff/D65023311) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137763 Approved by: https://github.com/yifuwang	2024-10-29 03:31:19 +00:00
cyy	d8f99f39cb	Avoid unnecessary tensor constructions (#139039 ) Because Variable is an alias of Tensor Pull Request resolved: https://github.com/pytorch/pytorch/pull/139039 Approved by: https://github.com/Skylion007	2024-10-29 02:23:23 +00:00
Animesh Jain	e80fe7f13a	[dynamo][guards] Skip guards on empty nn module hooks (#138942 ) This brings some unsoundness in guards. Earlier we were skipping empty nn module hooks dict guard only on inbuilt nn modules, but as seen in https://github.com/pytorch/pytorch/issues/138386, there could be still be significant guard overhead. With this PR, we reduce the guard eval latency from 420 us to 280 us (1.5x reduction). Pull Request resolved: https://github.com/pytorch/pytorch/pull/138942 Approved by: https://github.com/ezyang, https://github.com/jansel ghstack dependencies: #139040, #138954	2024-10-29 02:11:47 +00:00
Animesh Jain	2aa5348356	[dynamo][guards] Skip no tensor aliasing guards on parameters (#138954 ) This is another unsound guard eval optimization. Its rare in practice to compile a function with two different parameters as inputs, and then later call the function with one parameter input as two different inputs (aliasing). This further reduces guard overhead from 280 us to 240 us for the model in https://github.com/pytorch/pytorch/issues/138386 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138954 Approved by: https://github.com/jansel ghstack dependencies: #139040	2024-10-29 02:11:47 +00:00
Animesh Jain	dee7e715ba	[dynamo][refactor] Remaining cleanup from config-cleanup of enable_cpp_guard_manager (#139040 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139040 Approved by: https://github.com/williamwen42, https://github.com/jansel	2024-10-29 02:11:39 +00:00
Jeff Daily	7c7b2d89ba	[ROCm] set hipblas workspace (#138791 ) Fixes #138532. This brings hipblas behavior in line with cublas behavior with respect to setting the workspace to an allocation from the caching allocator as well as the env var HIPBLAS_WORKSPACE_CONFIG. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138791 Approved by: https://github.com/naromero77amd, https://github.com/eqy, https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2024-10-29 01:37:55 +00:00
eqy	07b0d633b8	[cuDNN][SDPA] Bail out of cuDNN SDPA for seqlen 1 inputs (#138531 ) Forwarded #138529 to the cuDNN team but for now but we want to avoid dispatching to unsupported cases Pull Request resolved: https://github.com/pytorch/pytorch/pull/138531 Approved by: https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2024-10-29 01:03:36 +00:00
Syed Tousif Ahmed	1637a40796	Adds snapshot API for MemPools to get pool memory segments (#133601 ) Canonically, the snapshot API returns the entire memory state of the CUDACachingAllocator (using `get_all_blocks`). There is no API that can only return the memory state of a given pool. In this PR, we extend the functionality of snapshot API such that it can only return the memory addresses of an active pool. When snapshot API is called under a MemPoolContext, we only return the blocks that correspond to the pool id of the active pool. Part of https://github.com/pytorch/pytorch/issues/124807. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133601 Approved by: https://github.com/ezyang	2024-10-29 01:01:47 +00:00
eellison	c066f4a055	Dont decompose aten.baddmm in inductor (#137904 ) Previously the decomposition would upcasts inputs to fp32. This led to a slowdown compared to eager which would run in fp16. We also tried keeping the bmm in fp16, and the upcasting for the epilogue but that led to worse numerics because the bmm in eager would do the epilogue all in fp32 without a downcast in the bmm accumulator. Fix for https://github.com/pytorch/pytorch/issues/137897 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137904 Approved by: https://github.com/ngimel	2024-10-29 00:54:29 +00:00
Jason Ansel	2b937e4e6d	[inductor] Cooperative reductions (#137756 ) Example generated code for `(x+y).sum()`: ```py @triton.jit def triton_unk_fused_add_sum_0(in_ptr0, in_ptr1, out_ptr0, ws_ptr, semaphores_ptr, xnumel, rnumel, XBLOCK : tl.constexpr, RBLOCK : tl.constexpr, RSPLIT : tl.constexpr): xnumel = 1 rnumel = 1048576 rsplit_id = tl.program_id(0) num_rblocks = (rnumel + RBLOCK - 1) // RBLOCK rsplit_chunk = (num_rblocks + RSPLIT - 1) // RSPLIT * RBLOCK rsplit_start = rsplit_chunk * rsplit_id rsplit_end = rsplit_chunk * (rsplit_id + 1) xoffset = tl.program_id(1) * XBLOCK xindex = xoffset + tl.arange(0, XBLOCK)[:, None] xmask = tl.full([XBLOCK, RBLOCK], True, tl.int1) rbase = tl.arange(0, RBLOCK)[None, :] _tmp4 = tl.full([XBLOCK, RBLOCK], 0, tl.float32) for roffset in range(rsplit_start, rsplit_end, RBLOCK): rindex = roffset + rbase rmask = rindex < rnumel r0 = rindex tmp0 = tl.load(in_ptr0 + (r0), rmask, eviction_policy='evict_first', other=0.0) tmp1 = tl.load(in_ptr1 + (r0), rmask, eviction_policy='evict_first', other=0.0) tmp2 = tmp0 + tmp1 tmp3 = tl.broadcast_to(tmp2, [XBLOCK, RBLOCK]) tmp5 = _tmp4 + tmp3 _tmp4 = tl.where(rmask, tmp5, _tmp4) tmp4 = tl.sum(_tmp4, 1)[:, None] if RSPLIT > 1: tmp4_ws = (ws_ptr + 0).to(tl.pointer_type(tl.float32)) tl.store(tmp4_ws + (xindex * RSPLIT + rsplit_id), tmp4, None) if RSPLIT > 1: triton_helpers.gpu_barrier(semaphores_ptr + (2 * tl.program_id(1) + 0), RSPLIT, True) if RSPLIT > 1: tmp4_peers = tl.load(tmp4_ws + (xindex * RSPLIT + tl.arange(0, RSPLIT)[None,:]), None, eviction_policy='evict_first') tmp4 = tl.sum(tmp4_peers, 1)[:, None] if rsplit_id == (0 % RSPLIT): tl.store(out_ptr0 + (tl.full([XBLOCK, 1], 0, tl.int32)), tmp4, None) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/137756 Approved by: https://github.com/eellison	2024-10-29 00:45:53 +00:00
cyy	383d9e3de6	[4/N] Fix cppcoreguidelines-special-member-functions warnings (#139027 ) Follows #138796 Pull Request resolved: https://github.com/pytorch/pytorch/pull/139027 Approved by: https://github.com/ezyang	2024-10-29 00:18:18 +00:00
wz337	5b39734a0a	[DTensor][Test] Fix gloo backend failure when eager_init is turned on (#139097 ) We should only pass the `device_id` when the backend is `nccl`. Otherwise, we would run into the following error: ``` RuntimeError: No backend for the parent process group or its backend does not support splitting ``` This also fixes test failure is not asserted when using `with_comms()` or `with_comms(eager_init=False)`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139097 Approved by: https://github.com/XilunWu	2024-10-29 00:04:06 +00:00
cyy	aa2b17c330	[3/N] Don't skip ASAN on some tests (#139058 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/139058 Approved by: https://github.com/ezyang	2024-10-28 23:57:23 +00:00
cyy	5ab81099e3	[2/N] Fix object slice (#139036 ) Follows #138880 Pull Request resolved: https://github.com/pytorch/pytorch/pull/139036 Approved by: https://github.com/ezyang, https://github.com/Skylion007	2024-10-28 23:56:36 +00:00
Svetlana Karslioglu	e00ead400c	Add a temporary Survey about the search (#139096 ) - Add a link to the new search survey - Add .css classes needed for the search banner Pull Request resolved: https://github.com/pytorch/pytorch/pull/139096 Approved by: https://github.com/seemethere, https://github.com/cjyabraham	2024-10-28 23:43:25 +00:00
Adnan Akhundov	ab09c4d913	Add host-side TMA support to AOTInductor (#138878 ) This adds host-side Triton TMA support to AOTInductor. Notes: - Two helper functions, `init1DTMADescriptor` and `init2DTMADescriptor` are added to the C++ wrapper codegen on GPU, conditioned on the model having user-defined Triton kernels with host-side TMA (CUDA-specific). - C++ wrapper codegen on GPU emits TMA descriptor initialization via the aforementioned helper functions. - Special handling added for the TMA descriptors (in the Python wrapper codegen) during the compile-time autotuning, as the underlying tensor can't be passed directly to the user-defined Triton kernel. TMA descriptors are generated in-between the source tensor's buffer and the kernel call, like in the full Python wrapper codegen. - This PR concludes the host-side Triton TMA support in PT2. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138878 Approved by: https://github.com/desertfire, https://github.com/chenyang78 ghstack dependencies: #138759, #138877	2024-10-28 23:39:53 +00:00
Simon Fan	fd9f4e6770	Back out "[compiled autograd] tls access helpers (#138061 )" and Back out "[compiled autograd] Compiled autograd configs in TLS (#137821 )" (#139086 ) Summary: Original commit changeset: 9bf80c1492d7 Original Phabricator Diff: D64796226 Original commit changeset: aa1d9ef8f6e6 Original Phabricator Diff: D64796212 Differential Revision: D65072644 Pull Request resolved: https://github.com/pytorch/pytorch/pull/139086 Approved by: https://github.com/malfet	2024-10-28 23:37:05 +00:00
Nikita Shulga	18ad44e830	[BE] Test collect env against torch-2.* (#139122 ) And also update Python version to 3.9 Pull Request resolved: https://github.com/pytorch/pytorch/pull/139122 Approved by: https://github.com/kit1980	2024-10-28 23:17:38 +00:00
dependabot[bot]	ba749755f5	Bump rexml from 3.3.3 to 3.3.9 in /ios/TestApp (#139088 ) Bumps [rexml](https://github.com/ruby/rexml) from 3.3.3 to 3.3.9. - [Release notes](https://github.com/ruby/rexml/releases) - [Changelog](https://github.com/ruby/rexml/blob/master/NEWS.md) - [Commits](https://github.com/ruby/rexml/compare/v3.3.3...v3.3.9) --- updated-dependencies: - dependency-name: rexml dependency-type: indirect ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2024-10-28 15:47:10 -07:00
dependabot[bot]	23fb8baf37	Bump certifi from 2024.2.2 to 2024.7.4 in /tools/build/bazel (#130173 ) Bumps [certifi](https://github.com/certifi/python-certifi) from 2024.2.2 to 2024.7.4. - [Commits](https://github.com/certifi/python-certifi/compare/2024.02.02...2024.07.04) --- updated-dependencies: - dependency-name: certifi dependency-type: indirect ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2024-10-28 15:44:49 -07:00
Tugsbayasgalan Manlaibaatar	b7524b05d2	Make test_export training IR compatible (#138517 ) In this PR, I make test_export to be compatible with training IR. The idea is that when we flip the IR to non-functional training IR, all these tests should be green. The changes involve reading through the test case, and add necessary decomposition etc to make sure the tests pass. For example, if the tests expect to see mutated buffers returned, we need to get them via running run_decomp. Differential Revision: [D64732360](https://our.internmc.facebook.com/intern/diff/D64732360) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138517 Approved by: https://github.com/avikchaudhuri	2024-10-28 22:38:19 +00:00
William Wen	904816d1ed	[dynamo] handle 3.13.0 __dict__ watcher bug (#138284 ) https://github.com/python/cpython/pull/116115 introduced a bug (https://github.com/python/cpython/issues/125608) where changing the attributes of an object may not fire the dict watchers registered to the object's `__dict__`. It has been fixed by https://github.com/python/cpython/pull/125611 but will only be in 3.13.1+. This PR disables the dict watcher guard shortcut for `__dict__`s on 3.13.0 and warns the user to try using 3.13.1+ instead. We also added a simple test to check for this functionality in the future. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138284 Approved by: https://github.com/jansel ghstack dependencies: #138030	2024-10-28 22:25:21 +00:00
William Wen	35be6aef69	[dynamo] add some cpython debugging methods (#138030 ) This PR enables you to inspect PyObjects in C using `INSPECT(...)` without requiring https://docs.python.org/3/howto/gdb_helpers.html. `torch._dynamo.eval_frame.raise_sigtrap` can also be used to set gdb breakpoints while running Python code, e.g. ```python x = x + 1 torch._dynamo.eval_frame.raise_sigtrap(); # can breakpoint on ceval.c:CALL to breakpoint the `sin` call in C. x = torch.sin(x) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/138030 Approved by: https://github.com/jansel	2024-10-28 22:25:21 +00:00
Joseph Macaranas	edf2a1be97	[ROCm][CK] Explicit cast values to half (#138751 ) Addresses ambiguous conversions and calls introduced by these two pull requests: [[ROCm] CK-based GEMM](https://github.com/pytorch/pytorch/pull/131004) [[AMD] Fix torch ck backend build with 6.2.1](https://github.com/pytorch/pytorch/pull/138434) Co-authored-by: cjatin <cjatin@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/138751 Approved by: https://github.com/jeffdaily Co-authored-by: pruthvistony <pruthvigithub@gmail.com> Co-authored-by: cjatin <cjatin@users.noreply.github.com>	2024-10-28 22:00:26 +00:00
Ma Jian	ded83d2b16	support torch._utils._flatten_dense_tensors/_unflatten_dense_tensors … (#139023 ) Fixes #138897 Pull Request resolved: https://github.com/pytorch/pytorch/pull/139023 Approved by: https://github.com/ezyang	2024-10-28 21:59:07 +00:00
Guilherme Leobas	8785353f2f	Fix tensor subclass + dynamic shapes in torch.compile + aot autograd (#125941 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/125941 Approved by: https://github.com/bdhirsh ghstack dependencies: #133337	2024-10-28 21:58:59 +00:00
Guilherme Leobas	6baccb430b	Update TwoTensor impl. to accept `outer_size/outer_stride` (#133337 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133337 Approved by: https://github.com/bdhirsh	2024-10-28 21:58:59 +00:00
cyy	f4f0f2995d	Fix Wextra-semi warnings (#139000 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/139000 Approved by: https://github.com/ezyang	2024-10-28 21:48:51 +00:00
William Wen	52c80f663d	change name of dynamo CI chard to dynamo_wrapped (#138233 ) Implements https://github.com/pytorch/pytorch/issues/118127 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138233 Approved by: https://github.com/clee2000	2024-10-28 21:42:33 +00:00
PyTorch MergeBot	02339e674d	Revert "[PGNCCL] Make sure we do not use split for P2P comm creation (#139013 )" This reverts commit 74878ac271feecfa3ff3d32f78c7d889bcac97d6. Reverted https://github.com/pytorch/pytorch/pull/139013 on behalf of https://github.com/ZainRizvi due to Sorry but this appears to be breaking on trunk. See: distributed/_composable/test_composability/test_pp_composability.py::ComposabilityTest::test_manual_with_data_parallel_dp_type_DDP_ScheduleClass0_use_new_runtime_False [GH job link](https://github.com/pytorch/pytorch/actions/runs/11559910615/job/32177150816) [HUD commit link](`74878ac271`) ([comment](https://github.com/pytorch/pytorch/pull/139013#issuecomment-2442667605))	2024-10-28 21:30:28 +00:00
Mikayla Gawarecki	1a275fea4b	Remove numpy dependency for maia serialization (#137600 ) See rationale in #137444 description Pull Request resolved: https://github.com/pytorch/pytorch/pull/137600 Approved by: https://github.com/albanD	2024-10-28 20:57:35 +00:00
Jack Zhang	dd688099af	Update unbacked symints in torch.nonzero more precisely (#137663 ) ### Summary The fake impl for `nonzero` sets the symint's upper range to `sys.maxsize - 1` if there are any SymInts in the original input tensor shape. This PR constrains the range more intelligently by using the upper ranges of each SymInt in the input tensor shape. See https://github.com/pytorch/pytorch/pull/134899 as a merged solution for a similar problem for a different op. ### Test plan Added unit test to verify upper bound reduction calculation (`python test/export/test_export.py TestExport.test_nonzero_dynamic`) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137663 Approved by: https://github.com/ezyang	2024-10-28 20:57:23 +00:00
Bin Bao	8fa0479dd8	[inductor] Enable cpp wrapper for test_torchinductor (#138579 ) Summary: Expand cpp wrapper testing to test_torchinductor. Using skip_cpp_wrapper to skip failing tests for now, and fixes are coming later. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138579 Approved by: https://github.com/chenyang78, https://github.com/benjaminglass1	2024-10-28 20:35:25 +00:00
PyTorch MergeBot	e5595f10c8	Revert "[c10d][Partial-Graph Overlap] Support calling .wait_tensor() on output tensor of eager `async_op=True` collective if under `allow_inflight_collective_as_graph_input_ctx()` context manager (#137763 )" This reverts commit a688c57033b4536ef59356cdad241d65ca52a869. Reverted https://github.com/pytorch/pytorch/pull/137763 on behalf of https://github.com/yf225 due to Seems to have bad interaction with latest commits on trunk, reverting to be safe ([comment](https://github.com/pytorch/pytorch/pull/137763#issuecomment-2442527696))	2024-10-28 20:13:46 +00:00
Joel Schlosser	8ba9063002	FlexAttention support for NJT (#136792 ) This PR adds FlexAttention + NJT support. In particular: * To handle raggedness, treats the packed sequence dim of input NJTs as a giant "stacked sequence". To ensure user `score_mod` / `mask_mod` functions can still be written in the original NJT sequence space, this PR handles conversions for indices within the giant "stacked sequence" -> sequence relative indices automatically. * Provides `py_impls` for `NestedTensor` to the HOPs for flex attention forward / backward that simply wrap / unwrap NJTs appropriately * Adds barebones `new_empty()` support to NJT since FlexAttention utilizes this repeatedly; right now, only `new_empty()` with a shape of `()` is supported * Tests that FlexAttention with a causal mask matches causal SDPA * Adds a new public API for FlexAttention usage: * `create_nested_block_mask(mask_mod, B, H, njt, BLOCK_SIZE, _compile)` - NJT analogue for `create_block_mask()` that utilizes the `njt`'s ragged structure to create an appropriately-sized block mask (e.g. `(1, 1, total_seqlen, total_seqlen)`). This function handles the index conversion from "stacked sequence" space -> relative sequence space. * Minor note: as this is a public API, this function is purposefully named with "nested" instead of "njt" to keep the latter as an informal, mostly internal-only term. Example usage: ```python def causal_mask(b, h, q_idx, kv_idx): return q_idx >= kv_idx query = ... # NJT of shape (B, H, S, D) key = ... # NJT of shape (B, H, S, D) value = ... # NJT of shape (B, H, S, D) # create_nested_block_mask() automatically converts indices from "stacked sequence" space -> relative sequence space block_mask = create_nested_block_mask(causal_mask, 1, 1, query) # block mask conceptual shape is (B, H, sum(S), sum(S)) output = flex_attention(query, key, value, block_mask=block_mask) def causal_score_mod(score, b, h, q_idx, kv_idx): return torch.where(q_idx >= kv_idx, score, float("-inf")) # flex_attention() automatically converts indices from "stacked sequence" space -> relative sequence space for NJT inputs output2 = flex_attention(query, key, value, score_mod=causal_score_mod) ``` TODO: ~~Determine the right level of abstraction for public API helpers + move them alongside other helpers~~ Verify this with others though * ~~Some cleanup~~ * ~~`njt_score_mod_adapter`~~ * ~~Q: should `create_njt_block_mask()` call `njt_mask_mod_adapter()` so we don't need two calls?~~ * Can we avoid materializing the `sum(s)` length `seq_idx` used for conversion between stacked sequence -> sequence relative indices? * Not for now, although future work may deepen the integration between Flex + NJT (possibly requiring custom templates). We should try to cache this though. * ~~Demonstrate non-causal mask~~ * Support non-contiguous NJTs with holes (booted to future PR) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136792 Approved by: https://github.com/drisspg ghstack dependencies: #138841	2024-10-28 20:01:27 +00:00
Ryan Guo	4cd985a886	[dynamo] Remove some files from `dynamo_expected_failures` (#138935 ) Some tests in `test/dynamo` are marked as "expected failure when testing with `PYTORCH_TEST_WITH_DYNAMO=1`, i.e., we added files of those test names in the `dynamo_expected_failures` folder. However, a lot of those dynamo tests seem to be passing with `PYTORCH_TEST_WITH_DYNAMO=1`, so this patch removes them from `dynamo_expected_failures`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138935 Approved by: https://github.com/anijain2305	2024-10-28 19:41:26 +00:00
Avik Chaudhuri	9e06b5b5cb	fix unflatten with HOPs (#138978 ) Summary: Unflatten was broken for HOPs for a couple of reasons: (1) we didn't expect `get_attr` nodes in the exported program, but they can occur to hold graph arguments to HOPs; such attributes must be moved from the exported program to the corresponding unflattened submodule containing the HOP call. (2) we don't record metadata for graph arguments on serialization (there's nothing to hold it in our schema), and accordingly the `get_attr` nodes we create on deserialization don't have `nn_module_stack` metadata, which obviously wrecks unflatten. Test Plan: added a couple of tests Differential Revision: D65013647 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138978 Approved by: https://github.com/zhxchen17	2024-10-28 19:30:56 +00:00
Mwiza Kunda	c2ded9ec0d	Fix dot reference checks (#138596 ) dot reference implementation should be consistent with the cpu / cuda implementations since it may be used for meta dispatch i.e. ```python import torch x = torch.tensor([1,2,3], dtype=torch.float32) y = torch.tensor([4,5,6], dtype=torch.float16) x.dot(y) Traceback (most recent call last): File "<stdin>", line 1, in <module> RuntimeError: dot : expected both vectors to have same dtype, but found Float and Half ``` However the below does not raise an exception ```python x.to("meta").dot(y.to("meta")) ``` Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/138596 Approved by: https://github.com/bdhirsh	2024-10-28 19:11:40 +00:00
Richard Barnes	068f7e7a78	torch::optional -> std::optional (#138987 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138987 Approved by: https://github.com/Skylion007	2024-10-28 19:09:46 +00:00
PyTorch MergeBot	228963ad60	Revert "Add test for consistency between meta and CPU devices. (#138515 )" This reverts commit 006130d8eae834d17e3d3e21e61c506740cce6dc. Reverted https://github.com/pytorch/pytorch/pull/138515 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but the test is failing in trunk, maybe a landrace ([comment](https://github.com/pytorch/pytorch/pull/138515#issuecomment-2442357471))	2024-10-28 18:45:09 +00:00
Dan Zimmerman	f466df63a9	[torch] Address -Wreturn-type warning when compiling for AMD (#138951 ) Summary: Yep yep see title Test Plan: CI Differential Revision: D64971115 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138951 Approved by: https://github.com/cyyever, https://github.com/adamomainz	2024-10-28 18:26:40 +00:00
Sergii Dymchenko	817e57f832	Remove Python 3.8 from README (#139089 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139089 Approved by: https://github.com/clee2000, https://github.com/malfet	2024-10-28 18:12:11 +00:00
Laith Sakka	475ba1df8d	Expliclty avoid recording when should_record_events is false in record_shapeenv_event (#138965 ) Looking at the function record_shapeenv_event its hard to tell that it does not always run but we do disable it by setting top level is_recording to True self.should_record_events is false this makes it more explicit to avoid confusion and overloading is_recording. alternativley we can rename is_recording to do_no_record. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138965 Approved by: https://github.com/ezyang ghstack dependencies: #138804	2024-10-28 18:12:06 +00:00
Will Feng	a688c57033	[c10d][Partial-Graph Overlap] Support calling .wait_tensor() on output tensor of eager `async_op=True` collective if under `allow_inflight_collective_as_graph_input_ctx()` context manager (#137763 ) This PR aims to support the following use case: ```python def all_reduce_eager(x): y = x * x req = dist.all_reduce(y, op=dist.ReduceOp.SUM, async_op=True) assert isinstance(req, torch.distributed.Work) return y @torch.compile(fullgraph=True) def all_reduce_wait_compiled(y): torch.ops.c10d_functional.wait_tensor(y) return y * y x = torch.ones(1280, 1280, device="cuda") + self.rank with allow_inflight_collective_as_graph_input_ctx(): y = all_reduce_eager(x) z = all_reduce_wait_compiled(y) ``` where the collective is issued in eager (with `async_op=True`) but waited in compiled region. This is important for internal use cases such as TorchRec, where we issue collectives in eager for SparseArch all_to_all but want to wait for them in compiled region at beginning of OverArch, so that the all_to_all can be overlapped with the DenseArch compute that runs in parallel. ------ Test commands: - `pytest -rA test/distributed/test_inductor_collectives.py::TestCollectivesMultiProc::test_eager_async_allreduce_inductor_wait` - `pytest -rA test/test_fx.py::TestDCE::test_keep_collectives` - `pytest -rA test/test_fx.py::TestDCE::test_keep_collectives_no_overload` - `pytest -rA test/distributed/test_c10d_functional_native.py::TestWithNCCL::test_wait_tensor` - `pytest -rA test/distributed/test_c10d_functional_native.py::TestWithNCCL::test_unwaited` - `pytest -rA test/distributed/test_c10d_nccl.py::CommTest::test_wait_tensor` - `pytest -rA test/distributed/test_c10d_nccl.py::CommTest::test_unwaited` - `pytest -rA test/distributed/_tensor/test_tensor_ops.py::DistTensorOpsTest::test_equal` - `pytest -rA test/distributed/_tensor/test_random_ops.py::DistTensorRandomOpTest::test_manual_seed` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_ddp_baseline_aot_eager_multiprocess` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_aot_eager` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_setattr` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_unspecialized_forced_getattr_inline` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_unspecialized_forced_getattr_no_inline` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_asymmetric_compilation` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_compiler_collectives_automatic_dynamic_scalar` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_compiler_collectives_automatic_dynamic_speculation_divergence` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_compiler_collectives_automatic_dynamic_tensor` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_compiler_collectives_dim_mismatch` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_compiler_collectives_graph_break_empty_graph_still_collective` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_compiler_collectives_missing_source` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_compiler_collectives_scalar_missing_source` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_compiler_collectives_type_mismatch` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_ddp_activation_checkpointing` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_ddp_baseline_aot_eager_multiprocess` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_activation_checkpointing` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_aot_eager` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_inductor` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_setattr` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_unspecialized_forced_getattr_inline` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_unspecialized_forced_getattr_no_inline` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_hf_bert_ddp_aot_eager` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_hf_bert_ddp_aot_eager_static_graph` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_hf_bert_ddp_inductor` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_hf_bert_ddp_inductor_static_graph` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_hf_bert_fsdp_activation_checkpointing` - `pytest -rA test/distributed/_tensor/test_experimental_ops.py::DistOtherOpsTest::test_bernoulli` - `pytest -rA test/distributed/_tensor/test_dtensor_compile.py::TestDTensorCompileE2E::test_tp_compile_fullgraph_is_seq_parallel_True` - `pytest -rA test/distributed/test_inductor_collectives.py::TestCollectivesMultiProc::test_allreduce_inductor_cudagraph_trees` - `python benchmarks/dynamo/torchbench.py --ci --accuracy --timing --explain --inductor --device cuda --inference --bfloat16 --total-partitions 2 --partition-id 1 --output inference_torchbench.csv --only moco` ------ Differential Revision: [D65023311](https://our.internmc.facebook.com/intern/diff/D65023311) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137763 Approved by: https://github.com/yifuwang	2024-10-28 18:11:23 +00:00
Nikita Shulga	5c49db98b4	[EZ] Update minversion to 3.9.0 (#139085 ) Fixes https://github.com/pytorch/pytorch/issues/138979 Pull Request resolved: https://github.com/pytorch/pytorch/pull/139085 Approved by: https://github.com/kit1980, https://github.com/huydhn, https://github.com/seemethere, https://github.com/Skylion007	2024-10-28 18:04:29 +00:00
Ke Wen	74878ac271	[PGNCCL] Make sure we do not use split for P2P comm creation (#139013 ) Resolve comment https://github.com/pytorch/pytorch/pull/138527#issuecomment-2438613172 There was a split-vs-P2P bug: When P2P comm creation invokes `getNCCLComm`, it may see a `split_from` options which is meant for the previous PG creation. Then the P2P comm creation may use `ncclCommSplit` and hang, because not all ranks join this call. The bug slips previously/today because there is no CI test with the following recipe: eager init + new group + P2P in that new group. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139013 Approved by: https://github.com/shuqiangzhang	2024-10-28 18:03:25 +00:00
Bin Bao	fb2c750e9d	[AOTI][refactor] Move convert_arrayref_tensor_to_tensor logic (#139030 ) Summary: Move convert_arrayref_tensor_to_tensor codegen logic to cpp_wrapper_cpu_array_ref.py Test Plan: CI Differential Revision: D64904187 Pull Request resolved: https://github.com/pytorch/pytorch/pull/139030 Approved by: https://github.com/hl475	2024-10-28 18:00:41 +00:00
Masaki Kozuki	949fdd2997	remove redundant `a` (#139046 ) As per title, only one "a" is sufficient. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139046 Approved by: https://github.com/Skylion007	2024-10-28 17:47:24 +00:00
Catherine Lee	66a3c249ae	Linter for no workflows on fork (#138849 ) MInor, adds a linter that ensures that all jobs run on pull_request, schedule, push etc have a `if: github.repository_owner == 'pytorch'` or are dependent on a job that has that check There is also a setting in Github repos that can disable all workflows for that repo A lot of these are unnecessary because many jobs use reusable workflows that have that check. However, this is a one time change so I'm not that bothered Unfortunately I can't put this at the workflow level, which would make this better Lots of weird string parsing Pull Request resolved: https://github.com/pytorch/pytorch/pull/138849 Approved by: https://github.com/malfet	2024-10-28 17:46:50 +00:00
Jack Zhang	01b055abe3	Make masked_scatter core aten (#137949 ) Summary: Making `masked_scatter` core aten since it is hard to decompose and we now have a portable kernel for it Test Plan: N/A Differential Revision: D64368725 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137949 Approved by: https://github.com/larryliu0820	2024-10-28 17:31:53 +00:00
Edward Z. Yang	bca696ae81	Switch times to us in CompilationMetrics and improvements (#138975 ) Companion logger diff: https://www.internalfb.com/diff/D65012523 * Using float seconds for timestamps is bad because our internal system defaults to float32 precision and you don't even get second precision for timestamps in float32 * We decide to use microseconds instead of milliseconds because millisecond granularity you can end up with the same timestamp if compilation is happening very quickly; much better to force non-overlapping spans * Because there are so many new fields and I don't feel like reimplementing each on BwdCompilationMetrics, BwdCompilationMetrics is no more, it's just that everything in CompilationMetrics is now optional. * The actual frame compile times collection is not modified (still float) to reduce blast radius, so I just convert to microseconds before making the record. At float64 precision (Python's default), you get about microsecond precision on timestamps so shouldn't be a data problem (https://www.leebutterman.com/2021/02/01/store-your-unix-epoch-times-as-float64.html) * I rename some entries for clarity. In particular, whenever a timing contains all of the its lower phases (e.g., how Inductor also contains Triton compilation) we put "cumulative" in its name. If something doesn't happen at compile time but is delayed until we have actual real inputs, we put "runtime" in its name. Test plan: ``` buck2 run @mode/opt @mode/inplace //scripts/oulgen:runner ``` And then inspect https://fburl.com/scuba/dynamo_compile/sandbox/mslu7f5w and verify the us columns are populated and meaningful. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/138975 Approved by: https://github.com/masnesral	2024-10-28 17:17:18 +00:00
cyy	9b2c99d731	Move reduce to template parameter in vectorized_reduction (#138672 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/138672 Approved by: https://github.com/soulitzer	2024-10-28 17:13:12 +00:00
Prajesh Praveen Anchalia	3685c630b8	[pytorch] Plumb compile context from dynamo.export to aot_compile (#138793 ) Summary: tlparse shows unknown for certain items when _export.aot_compile() passes the graph obtained from dynamo.export() to inductor.aot_compile(), we also do not have access to the dynamo trace in the GraphModule exported by dynamo. This change plumbs through the compile_context into aot_compile as a part of GraphModule.meta without a major change to APIs within dynamo. Addresses issue: https://github.com/pytorch/pytorch/issues/123759?fbclid=IwY2xjawGE0LBleHRuA2FlbQIxMQABHS-PRpxvsrsHCDPdStHpqr1jQvx1YOnrPsRAfYAb-oXkU8MxidkIUENY-Q_aem_MAT2oaOgD03C8ggBNm575Q#issuecomment-2430722505 Test Plan: ``` buck2 test mode/opt //caffe2/test/dynamo:test_dynamo Buck UI: https://www.internalfb.com/buck2/ad64c267-65be-47cf-a94f-e4b26e6e030b Test UI: https://www.internalfb.com/intern/testinfra/testrun/9288674286334710 Network: Up: 83KiB Down: 314KiB (reSessionID-1dad223b-c91d-4718-97a4-bb2c81e480f0) Jobs completed: 10750. Time elapsed: 19:18.5s. Cache hits: 0%. Commands: 3 (cached: 0, remote: 0, local: 3) Tests finished: Pass 5365. Fail 2. Fatal 0. Skip 4. Build failure 0 buck2 test mode/opt //caffe2/test/dynamo:test_dynamo_fb Buck UI: https://www.internalfb.com/buck2/179a60bb-34e1-43b3-97ad-91af8a93ab01 Test UI: https://www.internalfb.com/intern/testinfra/testrun/2533275046340687 Network: Up: 201KiB Down: 1.8GiB (reSessionID-36f33983-6d78-4ec9-aa1b-34cee80dcb4f) Jobs completed: 17. Time elapsed: 42.9s. Cache hits: 0%. Commands: 1 (cached: 0, remote: 0, local: 1) Tests finished: Pass 6. Fail 0. Fatal 0. Skip 0. Build failure 0 ``` https://interncache-all.fbcdn.net/manifold/tlparse_reports/tree/logs/.tmpxZGXf6/index.html Repor fixed: https://github.com/pytorch/pytorch/issues/123759?fbclid=IwY2xjawGE0LBleHRuA2FlbQIxMQABHS-PRpxvsrsHCDPdStHpqr1jQvx1YOnrPsRAfYAb-oXkU8MxidkIUENY-Q_aem_MAT2oaOgD03C8ggBNm575Q#issuecomment-2430722505 Differential Revision: D64863946 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138793 Approved by: https://github.com/ezyang	2024-10-28 17:07:44 +00:00
Edward Z. Yang	91ded0576d	Add sym_log2 (#137980 ) Internal xref: https://fb.workplace.com/groups/1075192433118967/permalink/1515595595745313/ Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/137980 Approved by: https://github.com/bobrenjc93	2024-10-28 17:03:14 +00:00
Yukio Siraichi	006130d8ea	Add test for consistency between meta and CPU devices. (#138515 ) Reference: https://github.com/pytorch/pytorch/issues/138399 This PR introduces an `OpInfo` test that checks whether running each `out=` operation using meta inputs is consistent with using concrete (e.g. CPU) inputs. More specifically, it tests the case where the output tensors are not of the expected data type. According to the `out=` specification, some operations should error. I have added XFAIL to the set of operations that are currently failing. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138515 Approved by: https://github.com/ezyang	2024-10-28 16:58:48 +00:00
PyTorch MergeBot	4dd04db5d0	Revert "[Inductor][ROCm][CK] Enable lowering conv2d instances in CK Inductor backend (#138643 )" This reverts commit 4d92d6e60436b1aeffbf4dfce51f16923505251b. Reverted https://github.com/pytorch/pytorch/pull/138643 on behalf of https://github.com/wdvr due to reverting due to a large number of internal failures, see below ([comment](https://github.com/pytorch/pytorch/pull/138643#issuecomment-2442036958))	2024-10-28 16:18:38 +00:00
eellison	d90717e4e2	Add option to save real tensors in TORCH_COMPILE_DEBUG repro (#138110 ) This pr adds a utility to try to try to construct the corresponding real tensor values of fake tensors by seeing if their meta storage is contained in the meta converter. Then, we are able to save real tensor values for fx_graph_runnable if `TORCH_COMPILE_DEBUG_SAVE_REAL=1` is set. Differential Revision: [D64502744](https://our.internmc.facebook.com/intern/diff/D64502744) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138110 Approved by: https://github.com/ezyang	2024-10-28 16:18:22 +00:00
Nichols A. Romero	2922b9fee1	[ROCm] Fix ADDMM hipBLASLt regression (#138267 ) Fixes #138067 A partial reversion of this PR: https://github.com/pytorch/pytorch/pull/137604 The breakage is on AMD GPUs that do not fully support hipBLASLt, e.g. gfx1100 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138267 Approved by: https://github.com/eqy, https://github.com/jeffdaily	2024-10-28 16:07:11 +00:00
Sam Larsen	ad933578ed	[fx graph cache] FxGraphPickler: Remove hack to stabilize device string hashes (#138681 ) Summary: With the fast pickling mode, we don't need the custom hack for replacing device strings in tensors. This was previously needed because, e.g., two strings "cuda" will pickle differently if they are the same object vs. not. Test Plan: The new test fails with fast mode commented out, but succeeds when enabled: `python test/inductor/test_codecache.py -k test_stable_strings` Pull Request resolved: https://github.com/pytorch/pytorch/pull/138681 Approved by: https://github.com/oulgen	2024-10-28 15:23:56 +00:00
PyTorch MergeBot	3b0f39336c	Revert "Adds snapshot API for MemPools to get pool memory segments (#133601 )" This reverts commit 00504aa6b8b0ae68761b89f023184202e8c79bc8. Reverted https://github.com/pytorch/pytorch/pull/133601 on behalf of https://github.com/wdvr due to reverting for now as this breaks lots of internal tests. Details below ([comment](https://github.com/pytorch/pytorch/pull/133601#issuecomment-2441864871))	2024-10-28 15:12:20 +00:00
Xu Han	5916def695	Fix MKL status check wrong to MKLDNN. (#139049 ) Fix check MKL status wrong to MKLDNN. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139049 Approved by: https://github.com/ezyang, https://github.com/malfet	2024-10-28 14:28:56 +00:00
银河渡舟	4d8090cabb	Avoid file encoding issues when loading cpp extensions (#138565 ) I've found that when using `torch.utils.cpp_extension.load` on my Windows system, decoding errors occur when my .cpp/.cu files contain certain non-English characters. `test.py`: ```py from torch.utils.cpp_extension import load my_lib = load(name='my_cuda_kernel', sources=['my_cuda_kernel.cu'], extra_cuda_cflags=['-O2', '-std=c++17']) # ...... ``` `my_cuda_kernel.cu`: ```cpp #include <torch/types.h> #include <torch/extension.h> // 向量化 <------ some chinese characters // ...... ``` Errors will be reported as: ``` Traceback (most recent call last): File "E:\test\test.py", line 8, in <module> my_lib = load( ^^^^^ File "C:\Users\XXX\AppData\Roaming\Python\Python311\site-packages\torch\utils\cpp_extension.py", line 1314, in load return _jit_compile( ^^^^^^^^^^^^^ File "C:\Users\XXX\AppData\Roaming\Python\Python311\site-packages\torch\utils\cpp_extension.py", line 1680, in _jit_compile version = JIT_EXTENSION_VERSIONER.bump_version_if_changed( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\XXX\AppData\Roaming\Python\Python311\site-packages\torch\utils\_cpp_extension_versioner.py", line 46, in bump_version_if_changed hash_value = hash_source_files(hash_value, source_files) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\XXX\AppData\Roaming\Python\Python311\site-packages\torch\utils\_cpp_extension_versioner.py", line 17, in hash_source_files hash_value = update_hash(hash_value, file.read()) ^^^^^^^^^^^ UnicodeDecodeError: 'gbk' codec can't decode byte 0x96 in position 141: illegal multibyte sequence ``` The issue lies in the fact that the `open()` function in Python is platform-dependent, which can cause decoding errors when a file contains characters that are not supported by the default encoding. Pytorch uses file contents to generate hash string: `60c1433041/torch/utils/_cpp_extension_versioner.py (L16-L17)` In my windows the default encoding is `gbk` but all of my cpp files are in `utf-8`. There is a simple solution to this problem I think: just change the file reading mode to binary mode, which can avoid issues related to file encoding. It works perfectly on my computer. ```diff - with open(filename) as file: + with open(filename, 'rb') as file: hash_value = update_hash(hash_value, file.read()) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/138565 Approved by: https://github.com/malfet, https://github.com/janeyx99 Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2024-10-28 14:06:34 +00:00
cyy	1ec76dd1dc	Enable clang-tidy on torch/csrc/distributed (#139043 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/139043 Approved by: https://github.com/Skylion007	2024-10-28 13:56:54 +00:00
PyTorch MergeBot	60d1c7138d	Revert "[inductor] Cooperative reductions (#137756 )" This reverts commit fed37dbfbceefe306af648ff4fe1e0124c4d7844. Reverted https://github.com/pytorch/pytorch/pull/137756 on behalf of https://github.com/jeanschmidt due to ROCM tests are timing out :( ([comment](https://github.com/pytorch/pytorch/pull/137756#issuecomment-2441579322))	2024-10-28 13:24:33 +00:00
PyTorch MergeBot	2487a834a4	Revert "Add sym_log2 (#137980 )" This reverts commit 5d450d7facd7480482132408acc4c23d80933bab. Reverted https://github.com/pytorch/pytorch/pull/137980 on behalf of https://github.com/jeanschmidt due to lint broke from this onwards on main ([comment](https://github.com/pytorch/pytorch/pull/137980#issuecomment-2441570186))	2024-10-28 13:21:08 +00:00
Edward Z. Yang	8274dadac5	Make OpaqueUnaryFn pickleable (#138395 ) Fixes https://github.com/pytorch/pytorch/issues/138070 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/138395 Approved by: https://github.com/XuehaiPan, https://github.com/bobrenjc93	2024-10-28 13:10:04 +00:00
cyy	4d9b5a87e4	[3/N] Fix cppcoreguidelines-special-member-functions warnings (#138796 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/138796 Approved by: https://github.com/ezyang	2024-10-28 10:53:11 +00:00
Edward Z. Yang	2265c2d48c	Add pytorch.wait_counter.actual_codegen_and_compile WaitCounter (#138010 ) The current pytorch.wait_counter.codegen_and_compile scopes over cache hit/miss, so it doesn't accurately say if you're actually spending time doing Inductor compile or not. This counter /only/ is triggered when we're actually about to spend time in Inductor. It covers Inductor lowering, codegen as well as Triton compilation. It does NOT cover Triton compilation that occurs when you cache hit. Some more bikeshedding may be needed. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/138010 Approved by: https://github.com/markkm	2024-10-28 08:06:24 +00:00
Michael Lazos	46132dc026	[Dynamo] Refactor wrap_fx_proxy (#138933 ) During the work to dedup graphs for hierarchical compilation I tried to tame the `wrap_fx_proxy_cls` mess by separating the wrapping into three distinct scenarios (vs a jumble of conditionals). These are: 1) wrapping a preexisting tensor (`_wrap_fx_preexisting_tensor` 2) wrapping and tracing a new op into the graph (`_wrap_fx_proxy`) 3) handling a value that is some other proxyable data structure See `wrap_fx_proxy_cls` for the conditional tree handling these three cases. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138933 Approved by: https://github.com/williamwen42	2024-10-28 08:05:33 +00:00
PyTorch MergeBot	9ca749d6cd	Revert " [3/N] Fix cppcoreguidelines-special-member-functions warnings (#138796 )" This reverts commit 7cb3cef05f4b1d1b448a82a01420e2a9ed1ccfe0. Reverted https://github.com/pytorch/pytorch/pull/138796 on behalf of https://github.com/wdvr due to reverting since this started failing a windows test ([comment](https://github.com/pytorch/pytorch/pull/138796#issuecomment-2440710865))	2024-10-28 07:06:00 +00:00
Tuan Trieu	633dcf1a2d	Constant folding for lifted graph (#135060 ) Summary: Current implementation for lifted graph takes a dict of [constant name: constant value]. And the constant value is used to run_node and excute the constant graph to get the folded values and then create new getattr nodes for folded values. We don't have constant values for lifted graph during model compilation on MTIA. I think it is more general to allow the constant folding pass to just take the constant names only to produce the constant graph and represent the folded nodes as placeholders to make it consistent with lifted graph. Additionally, this mimic the real situation on Sigmoid, where Sigmoid executes the constant graph, get the folded values and set the folded values to the main graph. This diff is to update the pass to work with a list of constant names. Test Plan: ``` buck run mode/opt caffe2/test:test_export -- -r split_const_gm ``` Differential Revision: D62144791 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135060 Approved by: https://github.com/SherlockNoMad Co-authored-by: Tuan Trieu <tuant@meta.com>	2024-10-28 06:28:31 +00:00
Angela Yi	a99e8eeb97	Propagate real tensor tracing with torchbind + fixing side effects (#138797 ) Summary: * Fixed real tensor tracing w/ torchbind objs by passing the cloned tensor obj. For now I just catch the exception and have an error message if the `_clone` fails, but up for discussion on what to do here * Separate question, should we require people to set up FakeScriptObjects and stuff for draft mode? * Prevent side effects from happening when we do the first pass of custom ops profiling by cloning/copying everything. Not sure if deepcopying the model will succeed in all cases... But also I guess this path can be removed once custom ops profiling turns into one pass. Test Plan: `buck2 run @//mode/dev-nosan //scripts/angelayi/draft_export:test_draft_export` Reviewed By: ydwu4 Differential Revision: D64124825 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138797 Approved by: https://github.com/ydwu4	2024-10-28 06:27:36 +00:00
Simon Fan	dd9ff9f139	[compiled autograd] add tests for bwd hooks relative firing order (#139004 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139004 Approved by: https://github.com/yf225 ghstack dependencies: #139003	2024-10-28 05:55:56 +00:00
Simon Fan	fac74687a6	[compiled autograd] fix node origin graph comments (#139003 ) the comment update was done after prehooks were already collected, so prehooks would appear as part of the previous node Pull Request resolved: https://github.com/pytorch/pytorch/pull/139003 Approved by: https://github.com/yf225	2024-10-28 05:55:56 +00:00
cyy	f9ae3fac8c	[Distributed] [19/N] Fix clang-tidy warnings in torch/csrc/distributed/ (#138903 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/138903 Approved by: https://github.com/ezyang	2024-10-28 05:29:25 +00:00
cyy	39aa3cb8d6	Re-enable skipped ubsan tests (#139008 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/139008 Approved by: https://github.com/ezyang	2024-10-28 05:21:31 +00:00
Charles Coulombe	d2052ea84d	Update test_multiarray.py to support numpy 2.0+ (#138461 ) Import _core instead of core. Addresses partially #137182 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138461 Approved by: https://github.com/ezyang, https://github.com/albanD	2024-10-28 04:30:50 +00:00
Bob Ren	4c6ae39afd	Fix some nits in symbolic_shapes.py (#139018 ) While I was reading through this file for understanding, I fixed some nits. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139018 Approved by: https://github.com/ezyang	2024-10-28 04:27:12 +00:00
PyTorch UpdateBot	1fad37a023	[audio hash update] update the pinned audio hash (#138402 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned audio hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138402 Approved by: https://github.com/pytorchbot	2024-10-28 04:04:28 +00:00
PyTorch UpdateBot	6f5d538972	[executorch hash update] update the pinned executorch hash (#138661 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned executorch hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138661 Approved by: https://github.com/pytorchbot	2024-10-28 03:44:00 +00:00
Aaron Gokaslan	d72241d045	[Ez][BE]: Fix one more incorrect TypeIs (#139010 ) One other case where the side conditions could cause inaccurate typing info. Follow up to #138990 Pull Request resolved: https://github.com/pytorch/pytorch/pull/139010 Approved by: https://github.com/malfet	2024-10-28 03:36:45 +00:00
cyy	f7dc13806e	[2/N] Don't skip ASAN on some tests (#138663 ) Follows #138571 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138663 Approved by: https://github.com/ezyang	2024-10-28 03:35:57 +00:00
Edward Z. Yang	5d450d7fac	Add sym_log2 (#137980 ) Internal xref: https://fb.workplace.com/groups/1075192433118967/permalink/1515595595745313/ Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/137980 Approved by: https://github.com/bobrenjc93	2024-10-28 03:09:11 +00:00
Laith Sakka	c056dc4cb8	In Inductor, be willing to generate deferred runtime asserts when unbacked (#138804 ) Title + we avoid calling defer_assert when we statically know the guard results. timing for pnasnet5large ``` TIMING: code_gen:21.79672 inductor_compile:39.57726 backend_compile:65.30649 entire_frame_compile:95.22052 total_wall_time:95.22052 ``` matches with out the diff ``` TIMING: code_gen:21.89314 inductor_compile:39.72298 backend_compile:65.38539 entire_frame_compile:95.0854 total_wall_time:95.0854 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/138804 Approved by: https://github.com/ezyang	2024-10-28 02:19:55 +00:00
cyyever	7cb3cef05f	[3/N] Fix cppcoreguidelines-special-member-functions warnings (#138796 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/138796 Approved by: https://github.com/ezyang	2024-10-28 01:38:02 +00:00
cyy	d2ec289787	Turn header static function into inline (#138671 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/138671 Approved by: https://github.com/ezyang	2024-10-27 20:07:39 +00:00
Edward Z. Yang	192385e261	Add sym_sum to TorchInGraphFunctionVariable (#138848 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/138848 Approved by: https://github.com/Skylion007	2024-10-27 20:04:35 +00:00
Xu Han	beb15c80fb	print USE_STATIC_MKL for further debug. (#138902 ) print `USE_STATIC_MKL` for further debug. <img width="257" alt="image" src="https://github.com/user-attachments/assets/cd45bada-c28a-441a-b271-35956cfe1f21"> if we use `MKL`, then show its link method. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138902 Approved by: https://github.com/ezyang	2024-10-27 18:08:30 +00:00
Nikita Shulga	652a2ab93e	[BE] Skip `print(foo)` tests (#139009 ) Skipped `test_exponential` and `test_multinomial` because simply printing the result of an operator does not constitute a test. The testing framework does not attempt to interpret the output. Modify `test_print_non_contiguous` to get tensors string representation, which is an equivalent operation Pull Request resolved: https://github.com/pytorch/pytorch/pull/139009 Approved by: https://github.com/Skylion007	2024-10-27 18:04:03 +00:00
Ke Wen	ee11e2da1e	[PGNCCL] Use non-blocking mode by default in eager init (#138527 ) ### Why use non-blocking mode in eager init? For overlapping comm init and model init, etc. ![image](https://github.com/user-attachments/assets/9b0bf7a9-be26-4d16-827b-dbe861f083cd) ### Why can we set non-blocking as default? If the setting is dangling -- i.e. not passed in by user nor set via env -- `ProcessGroupNCCL` can have some preferred logic. And torch-level API semantics does not change whether the NCCL comm is blocking or non-blocking (handled within `ProcessGroupNCCL`). ### Why not make non-blocking default for lazy mode as well? PR https://github.com/pytorch/pytorch/pull/137544 tried it. Two reasons why that's not preferred today: 1. It is hard -- too big a blast. 2. There is no gain by doing lazy init in non-blocking mode, because the right next CPU call is a collective, and we will block there waiting for comm to be ready, so same effect as blocked init, no "opening" compared to eager mode. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138527 Approved by: https://github.com/wconstab ghstack dependencies: #138860	2024-10-27 17:40:43 +00:00
Jason Ansel	fed37dbfbc	[inductor] Cooperative reductions (#137756 ) Example generated code for `(x+y).sum()`: ```py @triton.jit def triton_unk_fused_add_sum_0(in_ptr0, in_ptr1, out_ptr0, ws_ptr, semaphores_ptr, xnumel, rnumel, XBLOCK : tl.constexpr, RBLOCK : tl.constexpr, RSPLIT : tl.constexpr): xnumel = 1 rnumel = 1048576 rsplit_id = tl.program_id(0) num_rblocks = (rnumel + RBLOCK - 1) // RBLOCK rsplit_chunk = (num_rblocks + RSPLIT - 1) // RSPLIT * RBLOCK rsplit_start = rsplit_chunk * rsplit_id rsplit_end = rsplit_chunk * (rsplit_id + 1) xoffset = tl.program_id(1) * XBLOCK xindex = xoffset + tl.arange(0, XBLOCK)[:, None] xmask = tl.full([XBLOCK, RBLOCK], True, tl.int1) rbase = tl.arange(0, RBLOCK)[None, :] _tmp4 = tl.full([XBLOCK, RBLOCK], 0, tl.float32) for roffset in range(rsplit_start, rsplit_end, RBLOCK): rindex = roffset + rbase rmask = rindex < rnumel r0 = rindex tmp0 = tl.load(in_ptr0 + (r0), rmask, eviction_policy='evict_first', other=0.0) tmp1 = tl.load(in_ptr1 + (r0), rmask, eviction_policy='evict_first', other=0.0) tmp2 = tmp0 + tmp1 tmp3 = tl.broadcast_to(tmp2, [XBLOCK, RBLOCK]) tmp5 = _tmp4 + tmp3 _tmp4 = tl.where(rmask, tmp5, _tmp4) tmp4 = tl.sum(_tmp4, 1)[:, None] if RSPLIT > 1: tmp4_ws = (ws_ptr + 0).to(tl.pointer_type(tl.float32)) tl.store(tmp4_ws + (xindex * RSPLIT + rsplit_id), tmp4, None) if RSPLIT > 1: triton_helpers.gpu_barrier(semaphores_ptr + (2 * tl.program_id(1) + 0), RSPLIT, True) if RSPLIT > 1: tmp4_peers = tl.load(tmp4_ws + (xindex * RSPLIT + tl.arange(0, RSPLIT)[None,:]), None, eviction_policy='evict_first') tmp4 = tl.sum(tmp4_peers, 1)[:, None] if rsplit_id == (0 % RSPLIT): tl.store(out_ptr0 + (tl.full([XBLOCK, 1], 0, tl.int32)), tmp4, None) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/137756 Approved by: https://github.com/eellison ghstack dependencies: #138970	2024-10-27 16:31:38 +00:00
Jason Ansel	3217ae2082	[inductor] Only apply score_fusion_memory_threshold to horizontal fusions (#138970 ) PR #136782 made `x.sum()+1` become two kernels, which hurts compile times as @ezyang noticed and breaks a lot of the tests in this stack. This reworks that heuristic to not apply as often. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138970 Approved by: https://github.com/shunting314	2024-10-27 16:31:38 +00:00
Wouter Devriendt	bae3426af7	reimport pr137735 due to merging check issues (#138959 ) This is a cherry-pick from #137735 by @mikaylagawarecki , that cannot be merged due to a (wrongly) failing check for codev @diff-train-skip-merge Pull Request resolved: https://github.com/pytorch/pytorch/pull/138959 Approved by: https://github.com/mikaylagawarecki	2024-10-27 16:31:34 +00:00
PyTorch MergeBot	144d75d934	Revert "[PGNCCL] Use non-blocking mode by default in eager init (#138527 )" This reverts commit 07e30eae2a8241e531890b6c9a33ab5a80c5ccaf. Reverted https://github.com/pytorch/pytorch/pull/138527 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but it is failing on ROCm ([comment](https://github.com/pytorch/pytorch/pull/138527#issuecomment-2440070035))	2024-10-27 15:39:33 +00:00
PyTorch MergeBot	d969b34377	Revert "In Inductor, be willing to generate deferred runtime asserts when unbacked (#138804 )" This reverts commit f1a677cba5ef7514f2cf303753d3117528867a33. Reverted https://github.com/pytorch/pytorch/pull/138804 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but it seems to fail pr_time_benchmarks job in trunk ([comment](https://github.com/pytorch/pytorch/pull/138804#issuecomment-2440069407))	2024-10-27 15:36:46 +00:00
Aaron Gokaslan	5d074746e9	[BE]: Add better optional typing (#138426 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/138426 Approved by: https://github.com/XuehaiPan, https://github.com/malfet	2024-10-27 14:19:00 +00:00
Bin Bao	d9534a50a9	[AOTI][refactor] Separate header codegen (#138882 ) Summary: Move arrayref specific header codegen logic to cpp_wrapper_cpu_array_ref.py, and consolidate some header files codegen logic Test Plan: CI Differential Revision: D64899248 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138882 Approved by: https://github.com/hl475	2024-10-27 14:14:27 +00:00
Yu, Guangye	40c098f731	Introduce a device-agnostic runtime API design (#132204 ) # Motivation According to [[RFC]A device-agnostic Python runtime API design for stream-based accelerators](https://github.com/pytorch/pytorch/issues/128403), this PR intends to introduce a device-agnostic runtime API design. I personally prefer the Simple Version APIs that no longer accept the device type as an input argument. It means we will leverage `getAccelerator` to fetch the current accelerator. And it is flexible to expand these APIs to handle multiple types of accelerator scenarios. The design does NOT break the previous design philosophies. I also believe that namespace torch.accelerator is better. It lets users know that the APIs they are calling are running on an accelerator rather than CPU. This is important. Meanwhile, we can follow a simple API design principle: 1. Device-agnostic APIs should be placed under the torch.accelerator namespace and not accept a device_type optional parameter. 2. Device-specific APIs should be placed under device-specific submodules. 3. APIS required by both CPU and accelerators should be placed under the torch namespace and accept a device_type optional parameter. Also, I list the pros and cons of Simple Version here: Pros: - `torch.accelerator.foo` will have the same input argument as `torch.xxx.foo`, bringing a better user experience; - more concise, facilitate the developer to write a device-agnostic code. Cons: - no obvious drawbacks. # Additional Context I list the new APIs here: ```python torch.accelerator.is_available() -> bool: torch.accelerator.current_accelerator() -> torch.device: torch.accelerator.device_count() -> int: torch.accelerator.current_device_idx() -> int: torch.accelerator.set_device_idx(device: Union[torch.device, str, int, None]) -> None: torch.accelerator.current_stream(device: Union[torch.device, str, int, None]) -> torch.Stream: torch.accelerator.set_stream(stream: torch.Stream) -> None: torch.accelerator.synchronize(device: Union[torch.device, str, int, None]) -> None: ``` According to the discussion with Alban, we decide to change the API name `set_device` to `set_device_idx` and `current_device` to `current_device_idx` for more explicit. And will submit other PR to support device and stream context manager. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132204 Approved by: https://github.com/EikanWang, https://github.com/abhilash1910, https://github.com/gujinghui, https://github.com/albanD	2024-10-27 10:37:09 +00:00
Ke Wen	1152726feb	[PGNCCL] Use recursive mutex in NCCLComm (#138997 ) Fixes #138995: [PGNCCL][BUG] mutex acquired in recursive way may deadlock The fix: use `std::recursive_mutex` to replace `std::mutex`. Found and proposed by @dsjohns2. Thanks! Pull Request resolved: https://github.com/pytorch/pytorch/pull/138997 Approved by: https://github.com/dsjohns2	2024-10-27 08:58:47 +00:00
Shunting Zhang	4681539f42	[inductor] force strides for efficient attn bwd (#138879 ) Try to fix https://github.com/pytorch/pytorch/issues/138772 . aten._scaled_dot_product_efficient_attention_backward requires the out and gradient_out to have stride order (3, 1, 2, 0). When Inductor layout optimization is enabled, Inductor may change tensor strides if they are not user visible. For efficient_attention_backward, Inductor tries to follow eager strides. But the eager strides Inductor gets for backward graph may be the one after optimization. There are a few possible fixes: 1. change the kernel to allow stride order other than (3, 1, 2, 0). This is probably hard 2. backout https://github.com/pytorch/pytorch/pull/112045/files and don't do layout optimization if the model contains efficient_attention. 3. Force (3, 1, 2, 0) strides order for the relevant tensors 4. Pass original eager layouts to Inductor for the backward graph. Let Inductor follow those layouts for tensors with extra layout requirement. The PR implements option 3. Option 4 looks more general to me, I think we can do this in long term. I tried to add a test but failed to repro: https://gist.github.com/shunting314/fe37a246aad269de9ea00199446688f6 Here is the original command to repro the issue: ``` TORCHINDUCTOR_LAYOUT_OPTIMIZATION=1 PYTORCH_NO_CUDA_MEMORY_CACHING=1 CUDA_LAUNCH_BLOCKING=1 time python benchmark.py --model maxvit_nano_rw_256 --precision bfloat16 --torchcompile --bench train --no-retry -b 64 ``` benchmark.py is https://github.com/huggingface/pytorch-image-models/blob/main/benchmark.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/138879 Approved by: https://github.com/drisspg, https://github.com/eellison	2024-10-27 04:54:15 +00:00
Edward Z. Yang	c480a479b1	Make automatic_dynamic state live per CodeId, rather than on code object (#138740 ) This is semantics changing as if you are dealing with multiple code objects which have exactly the same filename/firstlineno/name, but are distinct objects, and need non-aliasing automatic dynamic state. Otherwise, this should be equivalent (modulo lifetime). I want to do this because when I do PGO I can't index on code object identity, need a stable identifier. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/138740 Approved by: https://github.com/bobrenjc93 ghstack dependencies: #138693, #138717	2024-10-27 03:08:41 +00:00
Edward Z. Yang	14a45d7793	Refactor core algorithm for automatic dynamic shapes (#138717 ) While working on automatic dynamic PGO (https://github.com/pytorch/pytorch/pull/138052) one abstract property I was looking for out of profile information is that it formed a semilattice: I could join together two profiles and get a merged profile that is consistent with the profiles that I saw in both cases. While working on this data structure that supported joins, I realized that the base automatic dynamic algorithm could be implemented in this way, therefore this refactor. The basic recipe is that we now support a join operation on FrameStateSizeEntry. Intuitively, if you join two sizes that are equal, you get back that size (join(2, 2) == 2), but if you join two different sizes you get a special singleton auto_dynamic indicating that the size of the tensor is dynamic (join(2, 3) == auto_dynamic). So now, the automatic dynamic algorithm is: (1) compute the FrameStateSizeEntry that corresponds to the concrete values we've seen, and (2) join it into the ambient FrameStateSizeEntry. As a bonus, compiler collectives can buy into the same abstraction (we're simply distributing FrameStateSizeEntry from each node to every other node). For convenience, I also added the necessary `auto_unset` extra state which is the identity element (which makes our semilattice bounded from both top and bottom). Here, join(2, auto_unset) == 2. While doing this, there was a complication: the infer stride algorithm wasn't technically a semilattice. Here, I did what I suggested in the original code review https://github.com/pytorch/pytorch/pull/130232 which is stop using a heuristic, and instead replicate the stride inference algorithm in automatic dynamic. This means that when I join strides together, I don't join their concrete values, instead, if a stride can be inferred as the contiguous stride for a particular inner dimension, then you represent it as InferStride(dim). There's an example in code which I recommend looking at. Some other extra things that are happening in this PR: * I tried to deduplicate the size/stride automatic dynamic logic as much as possible. So hopefully less code to review here. * I had to reimplement all the logging. For the most part I tried to track the logging as closely to the original as possible, but I think we could be emitting less Chrome events here * The `marked_dynamic` handling is still preserved as is, but I kind of don't like it and we should figure out how to put it somewhere else Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/138717 Approved by: https://github.com/bobrenjc93 ghstack dependencies: #138693	2024-10-27 03:08:41 +00:00
Mu-Chu Lee	28013aa527	[AOTInductor] Disable comprehensive_padding when use_runtime_constant_folding=True (#138872 ) Summary: Disable comprehensive_padding when use_runtime_constant_folding=True. We need to disable the comprehensive padding because it modifies the stride thus the stride information between the constant graph and main graph will differ. Test Plan: ``` buck2 run mode/opt -c fbcode.platform010_cuda_version=12 -c fbcode.nvcc_arch=a100 caffe2/torch/fb/model_transform/experimental/benchmark:mts_gpu_benchmark -- --model-path=manifold://ads_storage_fblearner/tree/user/facebook/fblearner/predictor/643940255/17/gpu_lowering/input.predictor.disagg.gpu.merge --lower-backend="AOT_INDUCTOR_EP" --aot-inductor-config="{'max_autotune': True, 'aot_inductor.use_runtime_constant_folding': True}" ``` Reviewed By: 22quinn, henryoier Differential Revision: D64927546 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138872 Approved by: https://github.com/chenyang78	2024-10-27 01:12:27 +00:00
Mu-Chu Lee	fee17d530d	[AOTInductor] Add relu_nan_to_num option for pre-grad passes (#138545 ) Summary: Add a relu_nan_to_num in pre-grad pass. Test Plan: Included in commit Differential Revision: D64724780 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138545 Approved by: https://github.com/chenyang78	2024-10-27 00:57:11 +00:00
Richard Barnes	42994234a6	std::value/std::type -> std::_v/std::_t (#138746 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138746 Approved by: https://github.com/cyyever, https://github.com/malfet	2024-10-26 20:59:24 +00:00
cyy	fb36daac9f	[7/N] Fix extra warnings brought by clang-tidy-17 (#138972 ) Fix extra warnings brought by clang-tidy-17 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138972 Approved by: https://github.com/Skylion007	2024-10-26 19:09:47 +00:00
Yifu Wang	3a6f014381	[Inductor] improve the stride preservation logic of user-visible outputs (#136732 ) ## Context Previously, the stride preservation of user-visible nodes worked as follows: - After joint-graph tracing, we recorded the names of user-visible nodes and passed them to GraphLowering. - In GraphLowering, we determined whether we needed to preserve the striding for a certain node by checking if the node's name was in `user_visible_outputs`. - We obtained the original strides by checking `node.meta["val"].stride()`. However, there's a problem with this approach: the nodes in output_node.args[0] and their strides could change between the completion of joint-graph tracing and the consumption of `user_visible_outputs` (e.g., during post-grad passes), making it unreliable. ## This PR - After joint graph tracing: - Record the original strides for all nodes in `output_nodes.args[0]` as `output_node.meta["original_output_strides"]` (recording for all nodes in case we need the info for other purposes such as debugging). - Record the indices of user-visible outputs as `output_node.meta["user_visible_output_idxs"]`. - Remove the original plumbing of `user_visible_outputs`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136732 Approved by: https://github.com/Chillee	2024-10-26 18:49:14 +00:00
Nikita Shulga	1d83a893c5	[BE][MPS] Use templates in Repeat shader (#138962 ) - Instead of generating shader from templated code on host, just define two specializations of one kernel template - Get rid of unused `threads_per_threadgroup` argument - Replace `if (typeid(scalar_t) == typeid(int32_t))` with `if constexpr (std::is_same_v<scalar_t, int32_t>)` in the host code Pull Request resolved: https://github.com/pytorch/pytorch/pull/138962 Approved by: https://github.com/janeyx99	2024-10-26 17:42:07 +00:00
Taras	e78c4ded48	Use the unicode variant of the Windows API (#47422 ) (#138605 ) Use the unicode variant of the Windows API in c10/util/Backtrace.cpp - #47422 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138605 Approved by: https://github.com/peterjc123, https://github.com/malfet	2024-10-26 17:41:39 +00:00
cyy	1a73255102	Concat namespaces in jit code (#138976 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/138976 Approved by: https://github.com/Skylion007	2024-10-26 17:41:27 +00:00
Aaron Gokaslan	4de93d1ead	[BE][Ez]: Fix bad TypeIs conversion (#138990 ) Fixes on TypeIs / TypeGuard conversion error. Follow up to #133814 Thanks for @ezyang for reminding me to double check the side conditions here. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138990 Approved by: https://github.com/malfet	2024-10-26 17:37:40 +00:00
Laith Sakka	705f5b3489	Several enhancements for check_results.py (#137925 ) 1) always generate expected_results.csv up to accuracy of first three digits ex: 112313212312 --> 1120000000 .. etc 2) regenerate all record in expected_results.csv and not just failed ones , why? because if we change something by 1.3% and noise 1.5% we want to reflect that. 3) add "please update all results that changed significantly, and not only the failed ones" ``` (myenv) [lsakka@devgpu005.nha1 ~/pytorch/benchmarks/dynamo/pr_time_benchmarks (check_result_ehancements)]$ python check_results.py test_check_result/expected_test.csv te st_check_result/result_test.csv out WIN: benchmark ('a', 'instruction count') failed, actual result 9011111111 is -18.16% lower than expected 11011111111 ±1.00% please update the expected results. please update all results that changed significantly, and not only the failed ones REGRESSION: benchmark ('b', 'memory') failed, actual result 20011111111 is 99.89% higher than expected 10011111111 ±+10.00% if this is an expected regression, please update the expected results. please update all results that changed significantly, and not only the failed ones REGRESSION: benchmark ('c', 'something') failed, actual result 107111111111 is 969.92% higher than expected 10011111111 ±+10.00% if this is an expected regression, please update the expected results. please update all results that changed significantly, and not only the failed ones MISSING REGRESSION TEST: benchmark ('d', 'missing-test') does not have a regression test enabled for it. new expected results file content if needed: a,instruction count,9011000000,0.01 b,memory,20010000000,0.1 c,something,107100000000,0.1 There was some failures you can use the new reference expected result stored at path:out and printed above ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/137925 Approved by: https://github.com/aorenste	2024-10-26 16:27:55 +00:00
Yuanhao Ji	1a2dc89f17	[Dynamo] Allow `torch.cond()` to handle emply arguments (#138190 ) Fixes #138150 ```python import torch @torch.compile(fullgraph=True) def foo(x, y, z): def f(): return y + 2 def g(): return z + 1 return torch.cond(x, f, g) print(foo(torch.zeros(1), torch.ones(1), torch.ones(1))) # tensor([2.]) print(foo(torch.ones(1), torch.ones(1), torch.ones(1))) # tensor([3.]) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/138190 Approved by: https://github.com/ezyang, https://github.com/zou3519	2024-10-26 15:26:21 +00:00
Animesh Jain	c84f9b2069	[dynamo][guards] Log average time of constructed guard_manager (#138941 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138941 Approved by: https://github.com/jansel ghstack dependencies: #138512, #138896	2024-10-26 15:14:46 +00:00
Animesh Jain	dba6887dc6	[dynamo][refactor][config-cleanp] Use guard_manager consistently instead of check_fn (#138896 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138896 Approved by: https://github.com/williamwen42, https://github.com/jansel ghstack dependencies: #138512	2024-10-26 15:14:46 +00:00
Aaron Gokaslan	49ed365b22	[BE]: Update Typeguard to TypeIs for better type inference (#133814 ) Uses TypeIs instead of TypeGuard for better inference. See https://peps.python.org/pep-0742/ Pull Request resolved: https://github.com/pytorch/pytorch/pull/133814 Approved by: https://github.com/ezyang	2024-10-26 15:07:13 +00:00
James Wu	eb6c7b93a7	Log AOTAutogradCache state to PT2 Compile Events (#138604 ) Same as previous diff for inductor, but for autograd instead Differential Revision: [D64765199](https://our.internmc.facebook.com/intern/diff/D64765199/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138604 Approved by: https://github.com/oulgen	2024-10-26 15:04:38 +00:00
Laith Sakka	f1a677cba5	In Inductor, be willing to generate deferred runtime asserts when unbacked (#138804 ) Title + we avoid calling defer_assert when we statically know the guard results. timing for pnasnet5large ``` TIMING: code_gen:21.79672 inductor_compile:39.57726 backend_compile:65.30649 entire_frame_compile:95.22052 total_wall_time:95.22052 ``` matches with out the diff ``` TIMING: code_gen:21.89314 inductor_compile:39.72298 backend_compile:65.38539 entire_frame_compile:95.0854 total_wall_time:95.0854 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/138804 Approved by: https://github.com/ezyang	2024-10-26 15:03:53 +00:00
Joel Schlosser	14a17ad630	Elide calls to is_nested in Dynamo-traced graphs (#138841 ) Before this PR, calling `is_nested` in-graph would result in graph code like the following: ```python class GraphModule(torch.nn.Module): def forward(self, L_nt_: "f64[3, s1, 5]", s1: "Sym(s1)"): l_nt_ = L_nt_ # Note this useless line! getattr_1 = l_nt_.is_nested; getattr_1 = None add: "f64[3, s1, 5]" = l_nt_ + 2; l_nt_ = None return (add,) ``` This PR follows what is done for `is_sparse` / `is_quantized`: store it onto `TensorVariable` and have `getattr` calls to `is_nested` return the stored value as a constant. This removes the useless line above from the graph. Note that guarding is handled through tensor type check guards, so no need to guard on `is_nested` status. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138841 Approved by: https://github.com/soulitzer	2024-10-26 15:03:32 +00:00
Adnan Akhundov	3234b251b3	Fix typos in CreateTMADescriptorVariable (#138877 ) This fixes some leftover typos in CreateTMADescriptorVariable.call_function (and close). Pull Request resolved: https://github.com/pytorch/pytorch/pull/138877 Approved by: https://github.com/davidberard98, https://github.com/zou3519, https://github.com/Skylion007 ghstack dependencies: #138759	2024-10-26 15:03:07 +00:00
Xu Han	043864afdf	enable test_x86inductor_quantizer.py UTs on Windows. (#138937 ) This UTs are failed months ago, but due to the main branch move forward, some PRs fixed it. Let's turn on them. Local test passed: <img width="863" alt="image" src="https://github.com/user-attachments/assets/a2ec160c-cdf1-404d-bc24-2f60faa8d791"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/138937 Approved by: https://github.com/jansel	2024-10-26 12:48:51 +00:00
Wu, Chunyuan	a3aca24ae5	[AOTI] add C shim for QLinearPointwise (#138439 ) This PR adds C shim for `QLinearPointwisePT2E` and `QLinearPointwiseBinaryPT2E`. The below changes are needed: - We moved the qlinear API out of the anonymous namespace since we need to call it in the shim layer. - We fixed the code which generated the `inputs` and `constant_args` so that we can directly leverage the `codegen` of the parent class. - `x_scale` and `x_zp` are ensured to be tensor during the lowering stage, thus we can remove the code which handles whether they're tensor or not. `fb0da32377/torch/_inductor/mkldnn_lowerings.py (L492-L496)` `fb0da32377/torch/_inductor/mkldnn_lowerings.py (L499-L503)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/138439 Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5, https://github.com/desertfire	2024-10-26 08:04:15 +00:00
Simon Fan	99608ceed6	Scoped extension building for C++ backed custom ops tests (#136695 ) FIXES #125579 #131103 #133197 #133283 #134738 #135369 #135685 Tests that create C++ extensions can cause flakiness in CI due to library namespace conflict and test ordering. We can build them in temp dirs to ensure isolation. An alternative is to build these as part of the build process and have build time errors. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136695 Approved by: https://github.com/zou3519	2024-10-26 07:41:00 +00:00
Laith Sakka	10e2840ce3	Enable failing diffs on update_hint_regression and sum_floordiv_regression and autograd benchmarks regression (#137548 ) update_hint_regression has been behaving, so I am setting 2% noise threshold for it. 1.5% for sum_floordiv_regression. I have one concern, with the way we do the regression detection. small or changes <threshold level will accumulate and eventually trigger failure. to avoid those would have to keep any eye on the dashboard and potentially refresh the expected result file regularly even when there is no faluires. . Pull Request resolved: https://github.com/pytorch/pytorch/pull/137548 Approved by: https://github.com/aorenste	2024-10-26 07:28:49 +00:00
Ke Wen	07e30eae2a	[PGNCCL] Use non-blocking mode by default in eager init (#138527 ) ### Why use non-blocking mode in eager init? For overlapping comm init and model init, etc. ![image](https://github.com/user-attachments/assets/9b0bf7a9-be26-4d16-827b-dbe861f083cd) ### Why can we set non-blocking as default? If the setting is dangling -- i.e. not passed in by user nor set via env -- `ProcessGroupNCCL` can have some preferred logic. And torch-level API semantics does not change whether the NCCL comm is blocking or non-blocking (handled within `ProcessGroupNCCL`). ### Why not make non-blocking default for lazy mode as well? PR https://github.com/pytorch/pytorch/pull/137544 tried it. Two reasons why that's not preferred today: 1. It is hard -- too big a blast. 2. There is no gain by doing lazy init in non-blocking mode, because the right next CPU call is a collective, and we will block there waiting for comm to be ready, so same effect as blocked init, no "opening" compared to eager mode. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138527 Approved by: https://github.com/wconstab ghstack dependencies: #138860	2024-10-26 06:53:15 +00:00
Syed Tousif Ahmed	00504aa6b8	Adds snapshot API for MemPools to get pool memory segments (#133601 ) Canonically, the snapshot API returns the entire memory state of the CUDACachingAllocator (using `get_all_blocks`). There is no API that can only return the memory state of a given pool. In this PR, we extend the functionality of snapshot API such that it can only return the memory addresses of an active pool. When snapshot API is called under a MemPoolContext, we only return the blocks that correspond to the pool id of the active pool. Part of https://github.com/pytorch/pytorch/issues/124807. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133601 Approved by: https://github.com/ezyang	2024-10-26 03:34:59 +00:00
Kiuk Chung	940658405b	[test/test_cuda] Use temp file for test_improper_device_name (#138856 ) Use `tempfile.NamedTemporaryFile()` to have test_specify_improper_device_name save/load to a tmp file rather than the current-working-directory Pull Request resolved: https://github.com/pytorch/pytorch/pull/138856 Approved by: https://github.com/Skylion007	2024-10-26 02:42:25 +00:00
Yidi Wu	0ac9a663ec	[hop] always trace subgraph with fake to support .item in eager mode (#138771 ) Fixes https://github.com/pytorch/pytorch/issues/138664 When we eagerly run torch.cond with autograd keys set, we'll create_fw_bw_graph using real tensors. This PR forces fakification when cannot detect the fake mode so as to trace the .item calls. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138771 Approved by: https://github.com/zou3519, https://github.com/malfet	2024-10-26 02:17:17 +00:00
Ryan Guo	f14247d5aa	[dynamo] Accurately identify mutated cells captured by multiple functions (#138632 ) This patch changes `mutated_closure_cell_contents: Set[str]` to `mutated_closure_cell_ids: Set[int]` so that Dynamo can more accurately identify closure cells across different instances of `UserFunctionVariable`. This prevents Dynamo from mistakenly treat a cell as immutable, despite it'll be mutated when referenced as closure cell from another function. More context in https://github.com/pytorch/pytorch/issues/138112#issuecomment-2420580779. Fixes #138112. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138632 Approved by: https://github.com/jansel ghstack dependencies: #138639	2024-10-26 02:17:07 +00:00
Ryan Guo	1e1f0ceb40	Allow Lazy Module to be modelled as `UnspecializedNNModuleVariable` (#138639 ) This patch - removes the `is_lazy_module` check from `is_dynamic_nn_module`, and adds a regression test. - removes a series of dynamo expected failures on lazy modules. The few ones I checked all were failing due to speculation log divergence, similar to #138489. Note that #100047 introduced the conditional removed in this patch, and it was trying to fix #100001. But I've confirmed locally that #100001 no longer repros after this patch. Fixes #138489. See more context in the issue. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138639 Approved by: https://github.com/jansel	2024-10-26 02:17:07 +00:00
Aaron Gokaslan	4af93fdb77	[BE]: Update cudnn_frontend submodule to 1.8.0 (#138709 ) Update cudnn frontend. Let's see what breaks @eqy Pull Request resolved: https://github.com/pytorch/pytorch/pull/138709 Approved by: https://github.com/eqy	2024-10-26 01:55:33 +00:00
Yukio Siraichi	565a53d326	Use DLPack for creating tensors out of custom classes, when available. (#138697 ) Fixes #120614 Takes over #120615 In summary, this PR: - Adds a `__dlpack__` attribute check in the tensor creation path (i.e. [`internal_new_from_data` @ tensor_new.cpp](`cdfe1bffd1/torch/csrc/utils/tensor_new.cpp (L266)`)) - Creates the tensor by using the DLPack machinery, instead of an element-by-element copy - No changes since #120615 - Adds a test, making sure the DLPack machinery is used - Wraps a tensor in a fresh `TensorDLPackWrapper` class that implements only the DLPack methods - Creates a new tensor from an instance of `TensorDLPackWrapper` Pull Request resolved: https://github.com/pytorch/pytorch/pull/138697 Approved by: https://github.com/ezyang Co-authored-by: Wenzel Jakob <wenzel.jakob@epfl.ch>	2024-10-26 01:27:05 +00:00
Xia, Weiwen	e299193423	Bug fix: Use oneDNN for `torch._int_mm` CPU only when avx512_vnni is supported (#136942 ) Fixes #136746 If AVX512_VNNI is not supported, overflow occurs inside oneDNN. Fall back to ref path in such case. UT is also updated to catch the issue. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136942 Approved by: https://github.com/jgong5, https://github.com/ezyang	2024-10-26 01:17:11 +00:00
Scott Wolchok	a3de067975	[PyTorch] Use 128-bit vectors for ARM64 (#137426 ) The correct vector length for ARM64 is 128 bits (16 bytes). We were previously using double this, apparently just because that would be the same length as AVX2. Differential Revision: [D63984039](https://our.internmc.facebook.com/intern/diff/D63984039/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137426 Approved by: https://github.com/jgong5, https://github.com/malfet ghstack dependencies: #138486, #138542, #138655, #138716, #138744	2024-10-26 00:20:35 +00:00
Kiuk Chung	7ada814107	[c10/util] Add explicit include of <mutex> to c10/util/env.cpp (#138854 ) Add explicit include of `<mutex>` to `c10/util/env.cpp` since it has usages of `std::lock_guard` which is defined in the header `<mutex>`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138854 Approved by: https://github.com/cyyever, https://github.com/Skylion007	2024-10-26 00:16:05 +00:00
cyy	1605d4aeb8	Fix object slice (#138880 ) To avoid casting Tensor to Tensorbase Pull Request resolved: https://github.com/pytorch/pytorch/pull/138880 Approved by: https://github.com/Skylion007	2024-10-26 00:13:19 +00:00
Ke Wen	939fc4e335	[PGNCCL] Fix P2P data corruption in non-blocking mode (#138860 ) In non-blocking mode, it seems a single `ncclRecv` or `ncclSend` call can "early return" `ncclSuccess` before the kernel is fully enqueued. This causes the event record below missing the P2P the kernel, leading to data corruption. Side note: per NCCL, it is legal to call `ncclSend` or `ncclRecv` only if there is only one P2P op. This is true whether we are in blocking or non-blocking mode. In this fix, we use ncclGroup semantics to ensure that the kernel is enqueued for single-P2P ops. The ncclGroup call itself should introduce minimal overhead. Added a test `test_non_blocking_p2p`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138860 Approved by: https://github.com/shuqiangzhang	2024-10-25 23:58:43 +00:00
Ke Wen	54d13a9348	[c10d][CI] Improve world size setting in some tests (#138846 ) Following change in #137161 , bumping world size for some test suites. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138846 Approved by: https://github.com/fduwjj	2024-10-25 23:02:17 +00:00
Ke Wen	a57e418c1f	[PGNCCL] Use ncclSend and ncclRecv (#138875 ) Stop routing to `torch::cuda::nccl`. Use native `ncclSend` and `ncclRecv` APIs instead. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138875 Approved by: https://github.com/shuqiangzhang	2024-10-25 22:17:10 +00:00
Max Podkorytov	4d92d6e604	[Inductor][ROCm][CK] Enable lowering conv2d instances in CK Inductor backend (#138643 ) Set PYTORCH_MIOPEN_SUGGEST_NHWC environment variable to force output layout to channels-last. This way, the channels-last CK instances will be added to benchmark choices in max autotune # Testing ``` pytest test/inductor/test_ck_backend.py -k conv2d ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/138643 Approved by: https://github.com/chenyang78	2024-10-25 22:11:44 +00:00
PyTorch MergeBot	36b7135c6f	Revert "[fx graph cache] FxGraphPickler: Remove hack to stabilize device string hashes (#138681 )" This reverts commit 6cadf616aeb612f3c866b734268919ad1616ffaf. Reverted https://github.com/pytorch/pytorch/pull/138681 on behalf of https://github.com/jeanschmidt due to Introduced regressions on linux-focal-cuda11.8-py3.10-gcc9 ([comment](https://github.com/pytorch/pytorch/pull/138681#issuecomment-2438945493))	2024-10-25 22:07:30 +00:00
Jiawen Liu	14b8028c81	[Pytorch][ATEN] Enable FP8 NCCL in Pytorch ATEN (#138776 ) Summary: Enable FP8 NCCL in Pytorch ATEN to unblock FP8 collective communication such as FP8 all-to-all Test Plan: CI & D64374424 Differential Revision: D64866426 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138776 Approved by: https://github.com/eqy, https://github.com/jianyuh	2024-10-25 21:56:47 +00:00
Sam Larsen	86b45bde19	[pt2] Add logger logging for remote fx graph cache get + put (#138164 ) Summary: Capture the timing for the remote fx graph cache get and put operations and add them to the logger logging. Test Plan: 1) Landed D64483593 and waited for logger actualization. 2) Ran test script on devserver: `buck2 run mode/opt scripts/slarsen/torch_compile_model:run` 3) Queried dynamo_compile/sandbox: ``` (pytorch-3.10_4) devvm2296:~/local/pytorch-3.10_4 $ scuba -e="select time,co_filename,remote_fx_graph_cache_get_time_s,remote_fx_graph_cache_put_time_s from \`dynamo_compile/sandbox\` where remote_fx_graph_cache_put_time_s is not null" +------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------+----------------------------------+ \| time \| co_filename \| remote_fx_graph_cache_get_time_s \| remote_fx_graph_cache_put_time_s \| +------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------+----------------------------------+ \| 1729136266 \| null \| 0.05652284622192383 \| 0.9691152572631836 \| \| 1729136263 \| /data/users/slarsen/fbsource/buck-out/v2/gen/fbcode/289bb46b326874c6/scripts/slarsen/torch_compile_model/__run__/run-inplace#link-tree/scripts/slarsen/torch_compile_model/run.py \| 0.8298435211181641 \| 0.18642282485961914 \| +------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------+----------------------------------+ ``` Reviewed By: oulgen Differential Revision: D64484025 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138164 Approved by: https://github.com/jamesjwu, https://github.com/ezyang	2024-10-25 21:30:18 +00:00
Menglu Yu	78377ec130	[PT2][Optimus] Normalize Clamp to use kwargs (#138723 ) Summary: The current clamp normalization does not include torch.clamp where its min and max are not normalized to kwargs, thus the batch fusion of clamp can hit min and max are both empty problem. Test Plan: ``` buck2 run mode/opt servicelab/ai_ml/auto_tune:local_model_pt2 -- --flow_id 654509735 --test_mode split ``` GPU type: NVIDIA PG509-210 =============Print full analysis for offsite_cvr_oba_optout_dedicated_model================ \| Metric \| Value \| \|:-------------------\|:-----------------\| \| GPU type \| A100 \| \| Batch size \| 10 \| \| Latency \| 227.13 ms \| \| Model size \| 2322763344 bytes \| \| Flops/example \| 1136.52 G \| \| TFLOPS \| 50.04 \| \| MFU \| 16.04% \| \| Activation/example \| 2722.49 MB \| I1023 112249.043 local_model_with_pt2.py:25] benchmark results [('batch_size', 10), ('latency_ms', 22712), ('model_size_bytes', 2322763344), ('flops_per_example', 113652), ('tflops_g', 5003), ('mfu', 1603), ('activation_per_example_mb', 272249) Differential Revision: D64848369 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138723 Approved by: https://github.com/jackiexu1992	2024-10-25 21:05:39 +00:00
Shivam Raikundalia	a874ec85e8	[Functorch] Fix devices Parameter Type in benchmark_utilization Function (#138774 ) Summary: Issue described in https://github.com/pytorch/pytorch/issues/136697 Original user does not have CLA privileges so this is my commandeer Test Plan: OSS CI Differential Revision: D64872833 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138774 Approved by: https://github.com/davidberard98	2024-10-25 19:25:18 +00:00
Tugsbayasgalan (Tugsuu) Manlaibaatar	3a0c361899	Remove presere ops (#138371 ) Summary: CI #buildall Test Plan: CI Reviewed By: StellarrZ Differential Revision: D64151426 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138371 Approved by: https://github.com/bdhirsh	2024-10-25 19:13:55 +00:00
Ting Lu	b988388bac	Add CUDA 12.6 to Linux CD docker images (#138563 ) Reference https://github.com/pytorch/builder/pull/1003/files Related to #138440 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138563 Approved by: https://github.com/malfet	2024-10-25 19:10:07 +00:00
Eddie Yan	846b4e614b	[TF32][cuDNN][Convolution] Add some missing TF32 decorators (#138768 ) Newer cuDNN versions seem to be able to dispatch to cuDNN kernels Pull Request resolved: https://github.com/pytorch/pytorch/pull/138768 Approved by: https://github.com/Skylion007	2024-10-25 19:03:42 +00:00
Yidi Wu	c6bb9b53f4	[scan] better error handling and remove redundant tests (#137967 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137967 Approved by: https://github.com/zou3519	2024-10-25 19:01:25 +00:00
Guilherme Leobas	7d283309d8	Avoid calling `realize()` on LazyVariableTracker on reconstruct (#138495 ) Fixes: https://github.com/pytorch/pytorch/issues/137686 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138495 Approved by: https://github.com/zou3519	2024-10-25 19:01:15 +00:00
chilli	392221b390	Made DDPOptimizer work with HOPs (#138787 ) Fixes https://github.com/pytorch/pytorch/issues/137481 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138787 Approved by: https://github.com/yf225 ghstack dependencies: #138733, #138794, #138881	2024-10-25 18:59:01 +00:00
chilli	07dbc42881	Stop force realizing to prevent recursion errors unless it's much bigger (#138881 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138881 Approved by: https://github.com/shunting314 ghstack dependencies: #138733, #138794	2024-10-25 18:59:01 +00:00
Colin	de54246c42	Recomend pip install -r requirements in the unit testing guidelines. (#137797 ) Somehow make setup-env as recomended in CONTRIBUTING.MD is not installing all dependencies require to run tests This makes it slightly clearer when running tests. Specific repro on my side was ``` git checkout e7679663070e3149ae7cd6e28d376d86852ce9e4 make setup-env conda activate pytorch-deps python test/test_utils_internal.py ``` which is what my reading of the instructions implies should be correct. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137797 Approved by: https://github.com/albanD	2024-10-25 18:47:44 +00:00
Edward Z. Yang	03f9136870	Add wait counter on cuda::device_synchronize (#138883 ) The wait counter is typically only minute precision, but if there is a collective in the queue it will show up. We think this explains up to eight minutes of delay in some compile traces we're looking at, but the counter would definitively prove it. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Differential Revision: [D64944970](https://our.internmc.facebook.com/intern/diff/D64944970) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138883 Approved by: https://github.com/eqy	2024-10-25 18:13:57 +00:00
Edward Z. Yang	dbbdfd9df5	Add pytorch.wait_counter.dynamo_compile (#138072 ) I was discussing with James March how the current fx_codegen_and_compile counter doesn't actually capture all compile time. This one is more accurate and corresponds closely to the existing events in dynamo_compile table. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/138072 Approved by: https://github.com/markkm	2024-10-25 18:12:34 +00:00
Huy Do	77587f43d2	Add one more shard for CPU pull jobs (#138894 ) The first shard is close to 3.5 hours and timing out flakily in trunk now, for example https://github.com/pytorch/pytorch/actions/runs/11509141659/job/32039126506. So, I think we could just add one more shard in the same spirit as https://github.com/pytorch/pytorch/pull/137433 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138894 Approved by: https://github.com/Skylion007	2024-10-25 18:09:50 +00:00
Xinran / Allan Rui	ba6526814a	Add dtype attribute to CSEVariable (#136778 ) Summary: - This diff introduces `dtype` attribute to `TritonCSEVariable` and a dtype propagation helper function to infer dtype from input to output for each op. - There will be a follow-up diff that uses this `dtype` information in `TritonCSEVariable` to perform dtype-aware codegen. Test Plan: CI Differential Revision: D61815079 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136778 Approved by: https://github.com/eellison, https://github.com/blaine-rister	2024-10-25 18:00:30 +00:00
Adam Mainz	d0640b945b	[inductor][nit] removing unnecessary else statements (#138789 ) Summary: while reading through inductor template code I found a few places where else statements were driving me crazy. Fixing them as I read Test Plan: CI Differential Revision: D64882385 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138789 Approved by: https://github.com/aakhundov	2024-10-25 17:59:25 +00:00
Richard Barnes	69af467d4f	Eliminate c10::value_or_else (#138818 ) Test Plan: Sandcastle Differential Revision: D64857418 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138818 Approved by: https://github.com/malfet, https://github.com/Skylion007	2024-10-25 17:59:01 +00:00
Gagan Jain	a6287b5c27	Fixing issue in move pass for copying Parameter (#138855 ) Summary: Fixing bug for Parameter copy during move pass of exported graph. Test Plan: UT runs on APS models. Differential Revision: D64876951 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138855 Approved by: https://github.com/pianpwk Co-authored-by: Gagan Jain <gaganj@meta.com>	2024-10-25 17:57:27 +00:00
Brian Hirsh	375d71cc5a	plumb is_export flag to FunctionalTensorMode in analysis pass (#138836 ) Summary: there is an issue with functionalization V2 in export. This is a quick fix that plumbs `is_export` through to `run_functionalized_fw_and_collect_metadata`. Test Plan: CI Differential Revision: D64915263 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138836 Approved by: https://github.com/tugsbayasgalan	2024-10-25 17:56:14 +00:00
Richard Barnes	3d0aa6f049	Update readme with std::optional (#138914 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138914 Approved by: https://github.com/malfet	2024-10-25 17:40:58 +00:00
PyTorch MergeBot	6f66398ab8	Revert "[aotd] Unwrap unseen AsyncCollectiveTensor tangents (#138731 )" This reverts commit 245026af2d2f26c74993cb90e01bddbd627c6797. Reverted https://github.com/pytorch/pytorch/pull/138731 on behalf of https://github.com/jeanschmidt due to introduced regressions on linux-focal-cuda12.1-py3.10-gcc9-bazel-test ([comment](https://github.com/pytorch/pytorch/pull/138731#issuecomment-2438417669))	2024-10-25 17:37:32 +00:00
PyTorch MergeBot	447bb72822	Revert "[c10d][CI] Improve world size setting in some tests (#138846 )" This reverts commit 9c35e33d9b02e384f0d504f942a916e9e849b163. Reverted https://github.com/pytorch/pytorch/pull/138846 on behalf of https://github.com/jeanschmidt due to introduced breaks in linux-focal-cuda11.8-py3.10-gcc9 ([comment](https://github.com/pytorch/pytorch/pull/138846#issuecomment-2438415315))	2024-10-25 17:35:27 +00:00
Xuan Zhang	2980aed65b	[inductor][memory] restructuring memory.py and turn on the flag (#137205 ) Addressing additional comments given in PR https://github.com/pytorch/pytorch/pull/134874 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137205 Approved by: https://github.com/eellison	2024-10-25 17:19:34 +00:00
Animesh Jain	817b4988e4	[dynamo][config-cleanup] Remove enable_cpp_guard_manager=False codepath (#138512 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138512 Approved by: https://github.com/williamwen42, https://github.com/jansel	2024-10-25 16:41:55 +00:00
eellison	fe18a221eb	Add debug backend that applies CrossRefFakeMode, use in compiler bisector (#138651 ) I was debugging an internal ne divergence for a while that ended up being because of a bad meta. I added an explicit a config option and an explicit backend `aot_eager_decomp_partition_crossref` to enable the FakeCrossRefMode when running the graph. I added an explicit backend bc I suspect it will be useful for internal models but I'm also happy to leave as config option. It will only test ops that have meta to avoid memory overhead of hitting fallback path and running in eager. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138651 Approved by: https://github.com/zou3519, https://github.com/bdhirsh	2024-10-25 15:58:36 +00:00
Sam Larsen	6cadf616ae	[fx graph cache] FxGraphPickler: Remove hack to stabilize device string hashes (#138681 ) Summary: With the fast pickling mode, we don't need the custom hack for replacing device strings in tensors. This was previously needed because, e.g., two strings "cuda" will pickle differently if they are the same object vs. not. Test Plan: The new test fails with fast mode commented out, but succeeds when enabled: `python test/inductor/test_codecache.py -k test_stable_strings` Pull Request resolved: https://github.com/pytorch/pytorch/pull/138681 Approved by: https://github.com/oulgen	2024-10-25 15:52:58 +00:00
Yuanhao Ji	78a0158540	[Dynamo] Improve `args` in `higher_order_ops` [1/N] (#138799 ) Replaced hard-coded argument indices with meaningful variable names. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138799 Approved by: https://github.com/zou3519	2024-10-25 13:55:41 +00:00
Nikita Shulga	45b8155a07	[CI] Run periodic jobs only on pytorch/pytorch repo (#138874 ) Github by default tries to not run periodic jobs on forks, see https://docs.github.com/en/actions/managing-workflow-runs-and-deployments/managing-workflow-runs/disabling-and-enabling-a-workflow But there is a special test repo called `pytorch/canary`, that will run those workflows for next 60 days, which is a waste of resources Pull Request resolved: https://github.com/pytorch/pytorch/pull/138874 Approved by: https://github.com/huydhn	2024-10-25 13:42:37 +00:00
IvanKobzarev	245026af2d	[aotd] Unwrap unseen AsyncCollectiveTensor tangents (#138731 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138731 Approved by: https://github.com/bdhirsh	2024-10-25 12:35:52 +00:00
Howard Huang	2c82f73647	[Pipelining] Clean up hooks in zero bubble (#138720 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138720 Approved by: https://github.com/wconstab ghstack dependencies: #138119, #138504, #138735	2024-10-25 12:06:54 +00:00
Howard Huang	12755f45ff	[Pipelining] small comments and variable renames (#138735 ) Addressing the comments in previous PRs to update the variable names and add additional code comments Pull Request resolved: https://github.com/pytorch/pytorch/pull/138735 Approved by: https://github.com/wconstab ghstack dependencies: #138119, #138504	2024-10-25 12:06:54 +00:00
Ke Wen	9c35e33d9b	[c10d][CI] Improve world size setting in some tests (#138846 ) Following change in #137161 , bumping world size for some test suites. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138846 Approved by: https://github.com/fduwjj	2024-10-25 10:40:21 +00:00
Edward Z. Yang	a1175e3437	[BE] Strides are always non-negative, remove pointless test (#138784 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/138784 Approved by: https://github.com/Chillee	2024-10-25 10:39:32 +00:00
Mwiza Kunda	22d2e2d9a0	Set RUNPATH so installed tests can find the required shared libraries (#136627 ) This change fixes the RUNPATH of installed c++ tests so that the linker can find the shared libraries they depend on. For example, currently: ```bash venv/lib/python3.10/site-packages/torch $ ./bin/test_lazy ./bin/test_lazy: error while loading shared libraries: libtorch.so: cannot open shared object file: No such file or directory ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/136627 Approved by: https://github.com/malfet	2024-10-25 09:38:08 +00:00
Xuehai Pan	86d4b7d60b	[FX][export][dynamo] use `tuple` instead of `list` in normalized `args_spec` (#138212 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138212 Approved by: https://github.com/jansel	2024-10-25 06:43:55 +00:00
cyyever	ce631939f0	[Distributed] [18/N] Fix clang-tidy warnings in torch/csrc/distributed/ (#138692 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/138692 Approved by: https://github.com/ezyang	2024-10-25 05:32:38 +00:00
Nikita Shulga	b999daf7a9	Add sets to list of safe objects to de-serialize (#138866 ) Lists, dicts and tuples are already allowed, it's a bit weird not to exclude set from the list of basic containers. Test plan (in addition to unittest): ```python torch.save({1, 2, 3}, "foo.pt") torch.load("foo.pt", weights_only=True) ``` Fixes https://github.com/pytorch/pytorch/issues/138851 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138866 Approved by: https://github.com/mikaylagawarecki Co-authored-by: mikaylagawarecki <mikaylagawarecki@gmail.com>	2024-10-25 05:23:08 +00:00
dependabot[bot]	907f001a68	Bump onnx from 1.16.1 to 1.17.0 in /.ci/docker (#138719 ) Bumps [onnx](https://github.com/onnx/onnx) from 1.16.1 to 1.17.0. <details> <summary>Release notes</summary> <p><em>Sourced from <a href="https://github.com/onnx/onnx/releases">onnx's releases</a>.</em></p> <blockquote> <h2>v1.17.0</h2> <p>ONNX v1.17.0 is now available with exciting new features! We would like to thank everyone who contributed to this release! Please visit <a href="https://onnx.ai/">onnx.ai</a> to learn more about ONNX and associated projects.</p> <h1>Key Updates</h1> <h2>ai.onnx Opset 22</h2> <ul> <li>Update to support bfloat16: <ul> <li><a href="https://onnx.ai/onnx/operators/onnx__Acos.html#acos-22">Acos</a>, <a href="https://onnx.ai/onnx/operators/onnx__Acosh.html#acosh-22">Acosh</a>, <a href="https://onnx.ai/onnx/operators/onnx__Asin.html#asin-22">Asin</a>, <a href="https://onnx.ai/onnx/operators/onnx__Asinh.html#asinh-22">Asinh</a>, <a href="https://onnx.ai/onnx/operators/onnx__Atan.html#atan-22">Atan</a>, <a href="https://onnx.ai/onnx/operators/onnx__Atanh.html#atanh-22">Atanh</a>, <a href="https://onnx.ai/onnx/operators/onnx__AveragePool.html#averagepool-22">AveragePool</a>, <a href="https://onnx.ai/onnx/operators/onnx__Bernoulli.html#bernoulli-22">Bernoulli</a>, <a href="https://onnx.ai/onnx/operators/onnx__Conv.html#conv-22">Conv</a>, <a href="https://onnx.ai/onnx/operators/onnx__ConvTranspose.html#convtranspose-22">ConvTranspose</a>, <a href="https://onnx.ai/onnx/operators/onnx__Cos.html#cos-22">Cos</a>, <a href="https://onnx.ai/onnx/operators/onnx__Cosh.html#cosh-22">Cosh</a>, <a href="https://onnx.ai/onnx/operators/onnx__DeformConv.html#deformconv-22">DeformConv</a>, <a href="https://onnx.ai/onnx/operators/onnx__Det.html#det-22">Det</a>, <a href="https://onnx.ai/onnx/operators/onnx__Dropout.html#dropout-22">Dropout</a>, <a href="https://onnx.ai/onnx/operators/onnx__Elu.html#elu-22">Elu</a>, <a href="https://onnx.ai/onnx/operators/onnx__EyeLike.html#eyelike-22">EyeLike</a>, <a href="https://onnx.ai/onnx/operators/onnx__GRU.html#gru-22">GRU</a>, <a href="https://onnx.ai/onnx/operators/onnx__GlobalAveragePool.html#globalaveragepool-22">GlobalAveragePool</a>, <a href="https://onnx.ai/onnx/operators/onnx__GlobalLpPool.html#globallppool-22">GlobalLpPool</a>, <a href="https://onnx.ai/onnx/operators/onnx__GlobalMaxPool.html#globalmaxpool-22">GlobalMaxPool</a>, <a href="https://onnx.ai/onnx/operators/onnx__GridSample.html#gridsample-22">GridSample</a>, <a href="https://onnx.ai/onnx/operators/onnx__HardSigmoid.html#hardsigmoid-22">HardSigmoid</a>, <a href="https://onnx.ai/onnx/operators/onnx__HardSwish.html#hardswish-22">HardSwish</a>, <a href="https://onnx.ai/onnx/operators/onnx__InstanceNormalization.html#instancenormalization-22">InstanceNormalization</a>, <a href="https://onnx.ai/onnx/operators/onnx__LSTM.html#lstm-22">LSTM</a>, <a href="https://onnx.ai/onnx/operators/onnx__LpNormalization.html#lpnormalization-22">LpNormalization</a>, <a href="https://onnx.ai/onnx/operators/onnx__LpPool.html#lppool-22">LpPool</a>, <a href="https://onnx.ai/onnx/operators/onnx__MaxPool.html#maxpool-22">MaxPool</a>, <a href="https://onnx.ai/onnx/operators/onnx__MaxRoiPool.html#maxroipool-22">MaxRoiPool</a>, <a href="https://onnx.ai/onnx/operators/onnx__MaxUnpool.html#maxunpool-22">MaxUnpool</a>, <a href="https://onnx.ai/onnx/operators/onnx__Mish.html#mish-22">Mish</a>, <a href="https://onnx.ai/onnx/operators/onnx__Multinomial.html#multinomial-22">Multinomial</a>, <a href="https://onnx.ai/onnx/operators/onnx__NegativeLogLikelihoodLoss.html#negativeloglikelihoodloss-22">NegativeLogLikelihoodLoss</a>, <a href="https://onnx.ai/onnx/operators/onnx__RNN.html#rnn-22">RNN</a>, <a href="https://onnx.ai/onnx/operators/onnx__RandomNormal.html#randomnormal-22">RandomNormal</a>, <a href="https://onnx.ai/onnx/operators/onnx__RandomNormalLike.html#randomnormallike-22">RandomNormalLike</a>, <a href="https://onnx.ai/onnx/operators/onnx__RandomUniform.html#randomuniform-22">RandomUniform</a>, <a href="https://onnx.ai/onnx/operators/onnx__RandomUniformLike.html#randomuniformlike-22">RandomUniformLike</a>, <a href="https://onnx.ai/onnx/operators/onnx__RoiAlign.html#roialign-22">RoiAlign</a>, <a href="https://onnx.ai/onnx/operators/onnx__Round.html#round-22">Round</a>, <a href="https://onnx.ai/onnx/operators/onnx__Selu.html#selu-22">Selu</a>, <a href="https://onnx.ai/onnx/operators/onnx__Sin.html#sin-22">Sin</a>, <a href="https://onnx.ai/onnx/operators/onnx__Sinh.html#sinh-22">Sinh</a>, <a href="https://onnx.ai/onnx/operators/onnx__Softplus.html#softplus-22">Softplus</a>, <a href="https://onnx.ai/onnx/operators/onnx__Softsign.html#softsign-22">Softsign</a>, <a href="https://onnx.ai/onnx/operators/onnx__Tan.html#tan-22">Tan</a>, <a href="https://onnx.ai/onnx/operators/onnx__ThresholdedRelu.html#thresholdedrelu-22">ThresholdedRelu</a></li> </ul> </li> </ul> <h2>Python Changes</h2> <ul> <li>Support for numpy >= 2.0</li> </ul> <h1>Bug fixes and infrastructure improvements</h1> <ul> <li>Fix Check URLs errors <a href="https://redirect.github.com/onnx/onnx/pull/5972">5972</a></li> <li>Use CMAKE_PREFIX_PATH in finding libprotobuf <a href="https://redirect.github.com/onnx/onnx/pull/5975">5975</a></li> <li>Bump main VERSION_NUMBER to 1.17.0 <a href="https://redirect.github.com/onnx/onnx/pull/5968">5968</a></li> <li>Fix source and pip tar.gz builds on s390x systems <a href="https://redirect.github.com/onnx/onnx/pull/5984">5984</a></li> <li>Fix unique_name <a href="https://redirect.github.com/onnx/onnx/pull/5992">5992</a></li> <li>Fix SegFault bug in shape inference <a href="https://redirect.github.com/onnx/onnx/pull/5990">5990</a></li> <li>Fix onnx.compose when connecting subgraphs <a href="https://redirect.github.com/onnx/onnx/pull/5991">5991</a></li> <li>Fix conversion from split 11 to split 18 <a href="https://redirect.github.com/onnx/onnx/pull/6020">6020</a></li> <li>Update error messages for NegativeLogLikelihoodLoss inference function <a href="https://redirect.github.com/onnx/onnx/pull/6021">6021</a></li> <li>Generalize input/output number check in shape inference <a href="https://redirect.github.com/onnx/onnx/pull/6005">6005</a></li> <li>Replace rank inference with shape inference for Einsum op <a href="https://redirect.github.com/onnx/onnx/pull/6010">6010</a></li> <li>build from source instruction with latest cmake change <a href="https://redirect.github.com/onnx/onnx/pull/6038">6038</a></li> <li>Handle OneHot's depth value during shape inference <a href="https://redirect.github.com/onnx/onnx/pull/5963">5963</a></li> <li>Not to install cmake in pyproject.toml on Windows <a href="https://redirect.github.com/onnx/onnx/pull/6045">6045</a></li> <li>fix a skipped shape infer code <a href="https://redirect.github.com/onnx/onnx/pull/6049">6049</a></li> <li>Include the ".onnxtext" extension in supported serialization format <a href="https://redirect.github.com/onnx/onnx/pull/6051">6051</a></li> <li>Allow ReferenceEvaluator to return intermediate results <a href="https://redirect.github.com/onnx/onnx/pull/6066">6066</a></li> <li>Fix 1 typo in numpy_helper.py <a href="https://redirect.github.com/onnx/onnx/pull/6041">6041</a></li> <li>Remove benchmarking code <a href="https://redirect.github.com/onnx/onnx/pull/6076">6076</a></li> <li>Prevent crash on import after GCC 8 builds <a href="https://redirect.github.com/onnx/onnx/pull/6048">6048</a></li> <li>Check graph outputs are defined <a href="https://redirect.github.com/onnx/onnx/pull/6083">6083</a></li> <li>Enable additional ruff rules <a href="https://redirect.github.com/onnx/onnx/pull/6032">6032</a></li> <li>Add missing shape inference check for DequantizeLinear <a href="https://redirect.github.com/onnx/onnx/pull/6080">6080</a></li> <li>Add bfloat16 to all relevant ops <a href="https://redirect.github.com/onnx/onnx/pull/6099">6099</a></li> <li>fix(ci): install python dependencies with --only-binary :all: in manylinux <a href="https://redirect.github.com/onnx/onnx/pull/6120">6120</a></li> <li>fix: install google-re2 with --only-binary option <a href="https://redirect.github.com/onnx/onnx/pull/6129">6129</a></li> <li>Specify axis parameter for DequantizeLinear when input rank is 1 <a href="https://redirect.github.com/onnx/onnx/pull/6095">6095</a></li> <li>Pin onnxruntime to 1.17.3 for release CIs <a href="https://redirect.github.com/onnx/onnx/pull/6143">6143</a></li> <li>Fix INT4 TensorProto byte size is 5x larger than expected with negative values <a href="https://redirect.github.com/onnx/onnx/pull/6161">6161</a></li> <li>Mitigate tarball directory traversal risks <a href="https://redirect.github.com/onnx/onnx/pull/6164">6164</a></li> <li>Fix reference implementation for ScatterND with 4D tensors <a href="https://redirect.github.com/onnx/onnx/pull/6174">6174</a></li> <li>Addition of group > 1 in test and in backend for ConvTranspose <a href="https://redirect.github.com/onnx/onnx/pull/6175">6175</a></li> <li>Support for bfloat16 for binary, unary operators in reference implementation <a href="https://redirect.github.com/onnx/onnx/pull/6166">6166</a></li> <li>Refactor windows workflow to work on standard windows <a href="https://redirect.github.com/onnx/onnx/pull/6190">6190</a></li> <li>Fix a few crashes while running shape inference <a href="https://redirect.github.com/onnx/onnx/pull/6195">6195</a></li> <li>Update onnx to work with numpy>=2.0 <a href="https://redirect.github.com/onnx/onnx/pull/6196">6196</a></li> <li>Use sets to improve performance of dfs search <a href="https://redirect.github.com/onnx/onnx/pull/6213">6213</a></li> </ul> <!-- raw HTML omitted --> </blockquote> <p>... (truncated)</p> </details> <details> <summary>Commits</summary> <ul> <li><a href="`b8baa84466`"><code>b8baa84</code></a> Set version 1.17.0 for official release (<a href="https://redirect.github.com/onnx/onnx/issues/6405">#6405</a>)</li> <li><a href="`6d77b80821`"><code>6d77b80</code></a> [Cherry-Pick] Fix main url checks (<a href="https://redirect.github.com/onnx/onnx/issues/6312">#6312</a>) (<a href="https://redirect.github.com/onnx/onnx/issues/6327">#6327</a>)</li> <li><a href="`174938d8b7`"><code>174938d</code></a> [Cherry-Pick] Fix protobuf pkg 5.28.0 failing on Windows (<a href="https://redirect.github.com/onnx/onnx/issues/6342">#6342</a>) (<a href="https://redirect.github.com/onnx/onnx/issues/6347">#6347</a>)</li> <li><a href="`f18d5931ad`"><code>f18d593</code></a> [Cherry-Pick] Remove unused variables (<a href="https://redirect.github.com/onnx/onnx/issues/6303">#6303</a>) (<a href="https://redirect.github.com/onnx/onnx/issues/6324">#6324</a>)</li> <li><a href="`c58890537f`"><code>c588905</code></a> Set version in rel-1.17.0 to 1.17.0rc1 (<a href="https://redirect.github.com/onnx/onnx/issues/6317">#6317</a>)</li> <li><a href="`4392c2c9ae`"><code>4392c2c</code></a> Prepare for rel-1.17.0 (<a href="https://redirect.github.com/onnx/onnx/issues/6281">#6281</a>)</li> <li><a href="`cb54169e4f`"><code>cb54169</code></a> Update ort filter to 1.20.0 to skip tests known to fail with ort 1.19.0 (<a href="https://redirect.github.com/onnx/onnx/issues/6306">#6306</a>)</li> <li><a href="`99e1fd352c`"><code>99e1fd3</code></a> Bump reviewdog/action-misspell from 1.21.0 to 1.23.0 (<a href="https://redirect.github.com/onnx/onnx/issues/6268">#6268</a>)</li> <li><a href="`1920565505`"><code>1920565</code></a> Bump ossf/scorecard-action from 2.3.3 to 2.4.0 (<a href="https://redirect.github.com/onnx/onnx/issues/6273">#6273</a>)</li> <li><a href="`2e8f2289b9`"><code>2e8f228</code></a> Bump mypy from 1.10.1 to 1.11.1 (<a href="https://redirect.github.com/onnx/onnx/issues/6275">#6275</a>)</li> <li>Additional commits viewable in <a href="https://github.com/onnx/onnx/compare/v1.16.1...v1.17.0">compare view</a></li> </ul> </details> <br /> [![Dependabot compatibility score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=onnx&package-manager=pip&previous-version=1.16.1&new-version=1.17.0)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- <details> <summary>Dependabot commands and options</summary> <br /> You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot show <dependency name> ignore conditions` will show all of the ignore conditions of the specified dependency - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) You can disable automated security fix PRs for this repo from the [Security Alerts page](https://github.com/pytorch/pytorch/network/alerts). </details> Pull Request resolved: https://github.com/pytorch/pytorch/pull/138719 Approved by: https://github.com/ezyang Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2024-10-25 03:53:25 +00:00
David Berard	94e341c6a3	[user triton] fix codegen for tl.constexpr globals (#138757 ) Fixes #138509 tl.constexpr globals would be codegen-ed as `constexpr()` instead of `tl.constexpr()` if they were un-annotated. This fixes the issue (and adds a test). The correct handling was already added but the corrected string was not being used in the un-annotated branch. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138757 Approved by: https://github.com/oulgen	2024-10-25 03:00:42 +00:00
Will Feng	36c6ad71ba	[tlparse] Add `dynamo_graph_break_reason` logging to trace_structured (#138778 ) A common challenge during torch.compile enablement is to answer user's question: "where is the graph break?". This PR will help make it easier to answer by surfacing graph breaks and their corresponding user stack trace / compiler stack trace in a direct link e.g. `0_0_0/dynamo_graph_break_reason_0.txt` from tlparse index.html. ![image](https://github.com/user-attachments/assets/79cd43f5-af14-4d08-9d5b-cb47d8203851) ![image](https://github.com/user-attachments/assets/23233ee2-0d56-4526-bf9a-d22c337c4d18) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138778 Approved by: https://github.com/ezyang	2024-10-25 02:00:04 +00:00
chilli	9425c0767d	Fix free symbol handling in FlexAttention (#138794 ) Fixes https://github.com/pytorch/pytorch/issues/136196 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138794 Approved by: https://github.com/Skylion007 ghstack dependencies: #138733	2024-10-25 01:20:42 +00:00
Adnan Akhundov	f737e3fe2f	[inductor] Fix ReinterpretView call in TMADescriptor IR (#138759 ) As a result of #137768, `ReinterpretView` call in the `TMADescriptor` has become invalid. This leads to some TMA tests breaking in test_triton_kernels.py. In this PR, we fix this. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138759 Approved by: https://github.com/Chillee, https://github.com/eellison	2024-10-25 00:45:44 +00:00
Yifu Wang	ed9169df98	Removed the typing information for already deleted ProcessGroupCudaP2P (#138753 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138753 Approved by: https://github.com/weifengpy	2024-10-25 00:32:07 +00:00
Shivam Raikundalia	2f4af0f4e6	[Profiler] Disable Dynamo-Sensitive Profiler Tests (#138762 ) Summary: During compilation, a profiler context gets ignored so we should temporarily turn off tests that are failing due to dynamo. Once profiler integration with dynamo is introduced we can reintroduce these tests Test Plan: Make sure CI is passing again Differential Revision: D64867447 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138762 Approved by: https://github.com/davidberard98	2024-10-25 00:25:49 +00:00
Avik Chaudhuri	1d98a526dd	preserve signatures with multiple calls + buffer mutations (#138669 ) As called out in https://github.com/pytorch/pytorch/pull/137999, preserving signatures of multiple calls when buffer mutations are present was NYI. The main problem was that intermediate values of buffers were not tracked, so couldn't be propagated statefully between multiple calls (i.e., they would need to be explicitly passed around, defeating the unlifting needed for preserving signatures). This PR fixes this situation, by introducing module attributes that carry the necessary intermediate values of buffer mutations. In general, a buffer mutation can have several intermediate values it depends on recursively, even other buffers. So rather than tying an intermediate value with a particular buffer, we tie it with the submodules that create and read it. We install an attribute on all modules that create or read a particular intermediate value, sharing the same initial storage (i.e., initialized with the same empty tensor). For the module that creates this intermediate value, we copy the value into the corresponding attribute; and for the modules that read it, we read the corresponding attribute instead. Another complication that needed to be addressed was that a `run_decompositions` following an `export_for_training` was not preserving module call graphs, which is needed for unflattening and, in particular, used when remapping inputs. Fortunately some existing metadata already tracks provenance of nodes, which we could use to update a module call graph after functionalization / decomposition. Differential Revision: D64806175 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138669 Approved by: https://github.com/tugsbayasgalan	2024-10-25 00:13:25 +00:00
Shuqiang Zhang	4c91481656	[c10d] allow sub group to be eagerly inited even if default one is not (#138665 ) Summary: Currently, eager mode is applied either to all PGs or NONE of them. There are cases where we don't want to initialize the comms for default PG, but we still want to initialize the comms for sub PG. Now with a device_id passed to new group, we can achieve this case Test Plan: newly added UT Tags: Resolves https://github.com/pytorch/pytorch/issues/137018 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138665 Approved by: https://github.com/kwen2501 ghstack dependencies: #138781	2024-10-24 23:51:28 +00:00
Avik Chaudhuri	277b32c930	fix unflatten training ir test suffix (#138840 ) Test Plan: none Differential Revision: D64917214 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138840 Approved by: https://github.com/zhxchen17	2024-10-24 23:42:54 +00:00
Chirag Pandya	425ce2a7ee	[c10d] use a promise to delay watchdog shutdown (#138828 ) Summary: We always need to give the heartbeat monitor thread time to write out flight recorder dumps. Otherwise, the watchdog thread kills the heartbeat monitor thread too fast before it has time to write out the Flight Recorder logs. This change: 1. Removes the "sleep after exception" JK. We don't need to sleep for 8 minutes. 2. Use a promise between watchdog thread and heartbeat monitor thread to delay, at most, one minute to give Flight Recorder time to write out it's log on timeout. Test Plan: Tested on my local job and flight recorder successfully executed for the job. https://fburl.com/mlhub/38fj5yne The watchdog thread gives heartbeat thread time to write out the logs. In the logs we see: ``` [trainer4]:I1023 17:39:29.755507 12592 ProcessGroupNCCL.cpp:1950] [PG ID 0 PG GUID 0(precheck) Rank 12] slept for 1647ms giving time for flight recorder dumps to finish. ``` Differential Revision: D64857928 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138828 Approved by: https://github.com/d4l3k, https://github.com/fduwjj	2024-10-24 23:42:29 +00:00
Henry Tsang	751987eed1	[pt2] improve error logs for torch.cond and aoti package (#138647 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138647 Approved by: https://github.com/ydwu4, https://github.com/angelayi	2024-10-24 23:38:07 +00:00
Henry Tsang	3e4ba18eb5	[aoti] fix typo in codegen_dynamic_scalar (#138760 ) Summary: appears to be a typo Test Plan: ci Differential Revision: D64867271 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138760 Approved by: https://github.com/ezyang	2024-10-24 23:16:30 +00:00
Pian Pawakapan	09848c892a	[aot_compile] propagate ShapeEnv during lowering (#138362 ) We found that `export() -> _inductor.aot_compile()` lowering, 3 different ShapeEnvs get created, leading to errors when one ShapeEnv processes expressions created by another ShapeEnv. This plumbs the 2 places where ShapeEnv creation happens, detecting the original ShapeEnv from the GraphModule example values, so the original ShapeEnv is just reused. Differential Revision: D64613290 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138362 Approved by: https://github.com/angelayi	2024-10-24 22:22:14 +00:00
Angela Yi	51f6b946ae	[torchbind] Add generic __deepcopy__ method (#137613 ) Summary: Added a generic `__deepcopy__` method which will use the torchbind object's existing `__getattr__` and `__setattr__` to copy the torchbind object. This will later be used in [D64124825](https://www.internalfb.com/diff/D64124825) Differential Revision: D64124826 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137613 Approved by: https://github.com/ydwu4, https://github.com/zou3519	2024-10-24 22:14:55 +00:00
James Wu	282e6383c1	Add inductor cache metrics (#138603 ) Each inductor event should have exactly one hit, miss, bypass etc. Add it to the inductor compile event. Add triton_compile as a compiler phase with `dynamo_timed`. This way, we get PT2 Compile Event Logs for triton as well. Here's what triton events look like: {F1941513932} And this on a cache hit(since we still redo this work): {F1941514350} Inductor cache info: {F1941528530} Differential Revision: [D64703392](https://our.internmc.facebook.com/intern/diff/D64703392/) @diff-train-skip-merge Pull Request resolved: https://github.com/pytorch/pytorch/pull/138603 Approved by: https://github.com/oulgen	2024-10-24 22:09:34 +00:00
Yiming Zhou	e78a3e260b	[export] Add serdes_non_strict to tests (#138662 ) Summary: We expand the tests to cover serdes_non_strict. Currently failing tests are skipped. Test Plan: ``` buck2 test @//mode/dev-nosan //caffe2/test:test_export -- -r _serdes_non_strict ``` Differential Revision: D64709285 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138662 Approved by: https://github.com/avikchaudhuri	2024-10-24 21:35:32 +00:00
Bob Ren	500b2bc781	Have as_tensor always return a float64 tensor in dynamo (#138598 ) As discussed with @ezyang, this set of diffs are extracting fixes to problems discovered to flipping `specialize_float=False` in https://github.com/pytorch/pytorch/pull/137782. Since these codepaths are exercised in existing tests, I'm going to bias towards shipping speed and put these up with the primary test plan as the global CI. These code paths are all tested via existing tests when `specialize_float=False` and it feels a bit wonky to add more gated tests that only test behavior when this flag is True, especially since these code paths are already covered. That being said, I'm happy to add individual tests if reviewers insist or have a different POV. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138598 Approved by: https://github.com/ezyang ghstack dependencies: #138595	2024-10-24 20:50:28 +00:00
ernest-lu	5b50b0a9bc	remove dead code (#138690 ) Fixes issue-138673: [issue](https://github.com/pytorch/pytorch/issues/138673) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138690 Approved by: https://github.com/Aidyn-A, https://github.com/colesbury	2024-10-24 20:29:24 +00:00
Scott Wolchok	10a34dcd57	[PyTorch] Fix out-of-bounds array access in atomic_add_vec (#138744 ) There is no guarantee that `len` here is enough for a full vector. This was causing at least one test failure on https://github.com/pytorch/pytorch/pull/137426. Differential Revision: [D64857786](https://our.internmc.facebook.com/intern/diff/D64857786/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138744 Approved by: https://github.com/jgong5, https://github.com/malfet ghstack dependencies: #138486, #138542, #138655, #138716	2024-10-24 19:37:12 +00:00
Scott Wolchok	0af7632c10	[PyTorch] Fix ASAN failures for vec_test_all_types Cast test (#138716 ) The size of the destination array was too small. Differential Revision: [D64843491](https://our.internmc.facebook.com/intern/diff/D64843491/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138716 Approved by: https://github.com/jgong5, https://github.com/malfet ghstack dependencies: #138486, #138542, #138655	2024-10-24 19:37:12 +00:00
Scott Wolchok	cbafe1e7f3	[PyTorch] Unbreak VectorizedN fmadd/fmsub/clamp (#138655 ) These are ternary ops, not binary ops. Differential Revision: [D64794253](https://our.internmc.facebook.com/intern/diff/D64794253/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138655 Approved by: https://github.com/jgong5, https://github.com/malfet ghstack dependencies: #138486, #138542	2024-10-24 19:37:02 +00:00
Scott Wolchok	ead5738ff2	[PyTorch] Fix inductor bug with unrolled vectorized prod (#138542 ) This issue is one of two inductor bugs blocking land of #137426. Turned out to be simple Differential Revision: [D64734116](https://our.internmc.facebook.com/intern/diff/D64734116/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138542 Approved by: https://github.com/jgong5, https://github.com/malfet ghstack dependencies: #138486 Co-authored-by: leslie-fang-intel <leslie.fang@intel.com>	2024-10-24 19:36:51 +00:00
Scott Wolchok	6aa673377b	[PyTorch] Fix inductor CPU masked() body codegen when result dtype is bool and operator is where (#138486 ) In this case, it looks like we expect the body to be a VecMask (unify_mask_base_type is called by where()), but we didn't make it a VecMask. Now we do. Differential Revision: [D64702918](https://our.internmc.facebook.com/intern/diff/D64702918/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138486 Approved by: https://github.com/leslie-fang-intel, https://github.com/malfet	2024-10-24 19:36:41 +00:00
Shunting Zhang	239a21f37e	[Inductor] don't set XBLOCK larger than xnumel (#138730 ) When fp8 dtype is involved, Inductor may set min_elem_per_thread to be a positive value. This will force increasing XBLOCK even for a small xnumel (e.g. 1). Inductor will report an error later when sanity check the triton config. The simple fix here is to just not let XBLOCK to be larger than xnumel. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138730 Approved by: https://github.com/Chillee ghstack dependencies: #136782	2024-10-24 18:31:10 +00:00
PyTorch MergeBot	e7f1e306df	Revert "[c10d][Partial-Graph Overlap] Support calling .wait_tensor() within compiled region on output tensor of eager `async_op=True` collective (#137763 )" This reverts commit 362ca54f03f9bb72ba7633ed580fb788b1a8dea9. Reverted https://github.com/pytorch/pytorch/pull/137763 on behalf of https://github.com/wdvr due to this change is breaking our prod training pipeline (verified with bisect) by increasing memory consumption 4x and causing OOM ([comment](https://github.com/pytorch/pytorch/pull/137763#issuecomment-2435962833))	2024-10-24 17:46:09 +00:00
PyTorch MergeBot	8197e4c70d	Revert "[sparse] add search for optimal alg_id to torch.compile (#137427 )" This reverts commit 39bfba3f561e3125ce035de0bf90c8c7bcccd3ce. Reverted https://github.com/pytorch/pytorch/pull/137427 on behalf of https://github.com/jcaip due to this PR breaks AO tests ([comment](https://github.com/pytorch/pytorch/pull/137427#issuecomment-2435906592))	2024-10-24 17:27:06 +00:00
IvanKobzarev	5ea6777861	[subclass] Unwrap_tensor_subclasses micro optimization (#138498 ) unwrap_tensor_subclasses -> get_plain_tensors Is used at runtime. For small models this overhead is feasible in comparison with small compiled kernel. 1/ Removing asserts from runtime path 2/ Removing list creation with using optional output list to append argument Pull Request resolved: https://github.com/pytorch/pytorch/pull/138498 Approved by: https://github.com/bdhirsh	2024-10-24 16:54:54 +00:00
Shuqiang Zhang	fe458eef80	[c10d] fix a logic of using ncclCommSplit (#138781 ) Summary: Currently, whether split should be used depends on the size of subgroup. It's possible that default PG is not eagerly initialized yet, but split is still called. This PR fixes this issue by removing split's dependency on subgroup size Test Plan: Modified UT Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/138781 Approved by: https://github.com/kwen2501	2024-10-24 16:16:35 +00:00
Irem Yuksel	b021486405	Enable Windows Arm64 (#133088 ) This PR enables Pytorch for Windows on Arm64 - CPU only. Currently, there aren't any checks in place to build and test for Windows on Arm64, but we're working to implement those as soon as possible. We recommend using [Arm Performance Libraries (APL)](https://developer.arm.com/Tools%20and%20Software/Arm%20Performance%20Libraries) as a BLAS option, which is introduced in this PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133088 Approved by: https://github.com/malfet Co-authored-by: cristian panaite <panaite.cristian2000@gmail.com> Co-authored-by: Stefan-Alin Pahontu <56953855+alinpahontu2912@users.noreply.github.com> Co-authored-by: Ozan Aydin <148207261+ozanMSFT@users.noreply.github.com>	2024-10-24 16:10:44 +00:00
eqy	f7bb11dcc2	[cuDNN][cuDNN Frontend] Check in test for previously broken dBias check (#138725 ) see https://github.com/pytorch/pytorch/issues/137347, let's try to land before https://github.com/pytorch/pytorch/pull/138709 CC @malfet @drisspg @Skylion007 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138725 Approved by: https://github.com/Skylion007, https://github.com/drisspg	2024-10-24 15:33:58 +00:00
Richard Barnes	8f62832189	c10::nullopt -> std::nullopt (#138701 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138701 Approved by: https://github.com/Skylion007, https://github.com/malfet	2024-10-24 15:03:32 +00:00
Sam Larsen	7e62ac51a1	[pt2] [testing] Skip inductor_freezing - test_cpp_wrapper_cuda internally (#138366 ) Summary: It's been failing CI since probably forever; skip for now Pull Request resolved: https://github.com/pytorch/pytorch/pull/138366 Approved by: https://github.com/eellison	2024-10-24 14:40:13 +00:00
Isuru Fernando	5c88a9f6c0	Assume that indices are non-negative in _unsafe_masked_index (#137315 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137315 Approved by: https://github.com/eellison	2024-10-24 12:39:31 +00:00
Nick Westlake	0d9fb51028	Fix lru_cache where config is used (#134235 ) Ensure that any use of functools.lru_cache does not prevent config from being changed after the function has already run. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134235 Approved by: https://github.com/masnesral	2024-10-24 10:43:34 +00:00
Richard Barnes	e7d4de0e59	Eliminate C10_TYPENAME_CONSTEXPR (#138702 ) Test Plan: Sandcastle Differential Revision: D64833560 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138702 Approved by: https://github.com/malfet	2024-10-24 10:21:01 +00:00
Yu, Guangye	0efa590d43	[CI] Fix XPU CI failure (#138548 ) # Motivation Fix https://github.com/pytorch/pytorch/issues/138577. # Solution 1. All UTs in `test/inductor/test_compiled_optimizers.py` are fixed by https://github.com/pytorch/pytorch/pull/134170 2. UT in `test/inductor/test_pattern_matcher.py` is introduced by https://github.com/pytorch/pytorch/pull/138089, we will skip this UT due to the unsupported feature `max_autotune_gemm_backends:Triton`. 3. We have a new impl related to `histc`, so we remove the expected failure from `test/inductor/test_torchinductor_opinfo.py` 4. We support `avg_pool3d` for `fp16` data type, so we remove the expected failure from `test/inductor/test_torchinductor_opinfo.py` 5. CUDA-bias code is introduced by https://github.com/pytorch/pytorch/issues/138472, we just generalize it to `GPU_TYPE`. # Additional Context > Why update torch-xpu-ops commit pin here? We have to update commit pin to avoid the build failure raised by the code change [C10_UNUSED](https://github.com/pytorch/pytorch/pull/138364). > What does the feature of torch-xpu-ops update? 1. Add some foreach ops, like `unary ops` and `foreach_clamp_max` etc; 2. Add some maxpool ops forward and backward, like `averge_pool3d` and `max_pool3d` 3. Add some other ops, like `log_normal_`, `index_copy`, and `mode` etc; 4. fix build failure related to `C10_UNUSED`; Pull Request resolved: https://github.com/pytorch/pytorch/pull/138548 Approved by: https://github.com/malfet, https://github.com/EikanWang	2024-10-24 07:56:26 +00:00
Richard Barnes	dbf0fa811a	Remove C10_HOST_CONSTEXPR_EXCEPT_WIN_CUDA and CONSTEXPR_EXCEPT_WIN_CUDA (#138479 ) BC linter suppressed due to removal of `tools/linter/adapters/constexpr_linter.py` Pull Request resolved: https://github.com/pytorch/pytorch/pull/138479 Approved by: https://github.com/eqy, https://github.com/malfet	2024-10-24 07:51:05 +00:00
Xu Han	96b30dcb25	[Windows][cpu] mkl use mimalloc as allocator on Windows (#138419 ) We did a lot of optimization for PyTorch Windows, and we got good progress of it. But still some models have performance gap between PyTorch Windows and PyTorch Linux. Ref: https://pytorch.org/blog/performance-boost-windows/#conclusion From the blog conclusion, we found the `ResNet50` is typical case of it. Let's focus on the `ResNet50`, and collect the profiling log: ```cmd (nightly) D:\xu_git\dnnl_cb>python test_script_resnet50.py --------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ Name Self CPU % Self CPU CPU total % CPU total CPU time avg # of Calls --------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ model_inference 3.91% 682.427ms 100.00% 17.448s 17.448s 1 aten::conv2d 0.18% 30.906ms 64.79% 11.305s 2.133ms 5300 aten::convolution 0.45% 78.031ms 64.62% 11.275s 2.127ms 5300 aten::_convolution 0.30% 51.670ms 64.17% 11.196s 2.113ms 5300 aten::mkldnn_convolution 63.58% 11.093s 63.87% 11.145s 2.103ms 5300 aten::batch_norm 0.13% 23.536ms 20.10% 3.506s 661.580us 5300 aten::_batch_norm_impl_index 0.28% 49.486ms 19.96% 3.483s 657.139us 5300 aten::native_batch_norm 19.26% 3.360s 19.64% 3.427s 646.615us 5300 aten::max_pool2d 0.01% 1.038ms 5.84% 1.018s 10.181ms 100 aten::max_pool2d_with_indices 5.83% 1.017s 5.83% 1.017s 10.171ms 100 aten::add_ 3.38% 588.907ms 3.38% 588.907ms 85.349us 6900 aten::relu_ 0.35% 60.358ms 1.67% 292.155ms 59.624us 4900 aten::clamp_min_ 1.33% 231.797ms 1.33% 231.797ms 47.306us 4900 aten::empty 0.46% 80.195ms 0.46% 80.195ms 1.513us 53000 aten::linear 0.01% 927.300us 0.23% 39.353ms 393.532us 100 aten::addmm 0.20% 35.379ms 0.21% 37.016ms 370.155us 100 aten::empty_like 0.12% 20.455ms 0.17% 29.976ms 5.656us 5300 aten::as_strided_ 0.11% 18.830ms 0.11% 18.830ms 3.553us 5300 aten::adaptive_avg_pool2d 0.00% 419.900us 0.08% 14.265ms 142.647us 100 aten::mean 0.01% 1.737ms 0.08% 13.845ms 138.448us 100 aten::sum 0.05% 8.113ms 0.05% 8.648ms 86.479us 100 aten::resize_ 0.03% 5.182ms 0.03% 5.182ms 0.978us 5300 aten::div_ 0.01% 1.445ms 0.02% 3.460ms 34.600us 100 aten::to 0.00% 337.000us 0.01% 2.015ms 20.154us 100 aten::_to_copy 0.01% 977.500us 0.01% 1.678ms 16.784us 100 aten::copy_ 0.01% 1.474ms 0.01% 1.474ms 7.371us 200 aten::t 0.00% 775.900us 0.01% 1.410ms 14.104us 100 aten::flatten 0.00% 420.900us 0.01% 1.311ms 13.106us 100 aten::view 0.01% 889.700us 0.01% 889.700us 8.897us 100 aten::transpose 0.00% 410.700us 0.00% 634.500us 6.345us 100 aten::expand 0.00% 496.800us 0.00% 566.800us 5.668us 100 aten::fill_ 0.00% 534.800us 0.00% 534.800us 5.348us 100 aten::as_strided 0.00% 293.800us 0.00% 293.800us 1.469us 200 aten::empty_strided 0.00% 241.700us 0.00% 241.700us 2.417us 100 aten::resolve_conj 0.00% 54.800us 0.00% 54.800us 0.274us 200 --------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ Self CPU time total: 17.448s Execution time: 20.02380895614624 ``` We found the major kernel consume CPU resource is `aten::mkldnn_convolution`. It was dispatched to `MKLDNN`. Acturally, we had optimized memory allocation via integrated mimalloc to pytorch C10 module. It helps PyTorch Windows boost a lot, but it does not cover `MKL` and `MKLDNN`'s intermediary temporary memory. We still have potential to improve PyTorch Windows performance via optimize `MKL` and `MKLDNN`'s intermediary temporary memory. So, I discussed with Intel MKL team, and get a method to register high performance memory allocation API to MKL, and it would help MKL to boost memory performance. Please check the online document: https://www.intel.com/content/www/us/en/docs/onemkl/developer-guide-windows/2023-0/redefining-memory-functions.html This PR is optimize MKL memory alloction performance on Windows, via register mi_malloc to MKL. PR Changes: 1. Add cmake option: `USE_MIMALLOC_ON_MKL`, It is sub-option of `USE_MIMALLOC`. 2. Wrap and export mi_malloc APIs in C10, when `USE_MIMALLOC_ON_MKL` is `ON`. 3. Add MklAllocationHelp.cpp to register allocation APIs to MKL, when `USE_MIMALLOC_ON_MKL` is `ON`. For `oneDNN`, it is still tracking in this proposal: https://github.com/oneapi-src/oneDNN/issues/1898 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138419 Approved by: https://github.com/jgong5, https://github.com/ezyang	2024-10-24 05:29:47 +00:00
chilli	a94c501b84	Fixed max-autotune in FlexAttention to reset kernel options appropriately (#138733 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138733 Approved by: https://github.com/drisspg, https://github.com/BoyuanFeng	2024-10-24 05:18:09 +00:00
cyy	2bcfbf2505	[Distributed] [17/N] Fix clang-tidy warnings in torch/csrc/distributed/ (#138465 ) Follows #137404 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138465 Approved by: https://github.com/ezyang	2024-10-24 04:58:49 +00:00
cyy	53e356a1c0	[2/N] Enable cppcoreguidelines-special-member-functions (#138670 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/138670 Approved by: https://github.com/sraikund16	2024-10-24 04:35:18 +00:00
Animesh Jain	cfdf658a91	[dynamo][modules] Support overridden __call__ on nn modules (#138619 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138619 Approved by: https://github.com/williamwen42 ghstack dependencies: #138657	2024-10-24 03:49:26 +00:00
Animesh Jain	b1acd0978e	[dynamo] Support range_iterator as a function input (#138657 ) Fixes https://github.com/pytorch/pytorch/issues/138654 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138657 Approved by: https://github.com/williamwen42, https://github.com/jansel	2024-10-24 03:49:26 +00:00
Doru Bercea	e5c3d7ab77	[ROCm] Improve performance of reductions on 1D and 2D tensors. (#137737 ) This patch improves the performance of individual reductions on MI300X. These improvements are measured on individual sum reduction operations of varying sizes. The patch impacts the following tensor types: - 1D tensors - 2D tensors when reducing along dimension 0 - 2D tensors when reducing along dimension 1 Runtime reduction between 0 and 75% depending on tensor shape. The patch uses the maximum number of threads per CU and the number of CUs itself to control the number of threadblocks in various situations (i.e. for various reduction types and tensor dimensions). Pull Request resolved: https://github.com/pytorch/pytorch/pull/137737 Approved by: https://github.com/eqy, https://github.com/jeffdaily, https://github.com/pruthvistony, https://github.com/xw285cornell	2024-10-24 03:41:16 +00:00
fduwjj	d8f22a1141	[c10d] Reorder GIL checker and c++ stack trace print with comments (#138734 ) We found one case when the GIL deadlock happens and then FR timeout, I am wondering if we can do the GIL check before cpp stack trace print which can lead to hang Pull Request resolved: https://github.com/pytorch/pytorch/pull/138734 Approved by: https://github.com/c-p-i-o	2024-10-24 02:21:37 +00:00
Colin L. Rice	0b9320b7c5	fx_graph_cache: Remove custom amd JK (#137501 ) This split in JKs was never actually used (We just set both JKs to the same values except when we accidentally didn't due to being humans who make mistakes). This simplifies the overall JK structure and eventually, will let us delete the duplicate JK Pull Request resolved: https://github.com/pytorch/pytorch/pull/137501 Approved by: https://github.com/oulgen	2024-10-24 01:30:39 +00:00
Howard Huang	32a3dbc645	[Pipelining] Free memory usage earlier in last stage (#138504 ) This fix is similar to that done in #138119, except this is an edge case for the last stage. For the last stage we perform backward on the `loss` which we detached in the previous PR. However, we also hold the `stage_outputs` alive because we return all the output chunks in `merge_output_chunks()` after the step is over. This will also still keep the autograd graph alive, so detaching these tensors frees the memory earlier. pre-fix: <img width="1780" alt="image" src="https://github.com/user-attachments/assets/bb78bde7-fd5c-4eba-bfc9-f0359e20bbab"> post-fix: <img width="1788" alt="image" src="https://github.com/user-attachments/assets/a26102d9-9db2-4fc8-946c-336b8430657c"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/138504 Approved by: https://github.com/wconstab ghstack dependencies: #138119	2024-10-24 00:44:03 +00:00
Howard Huang	8945309c08	[Pipelining] fix extra memory usage in zero bubble (#138119 ) Full debugging details in here: https://docs.google.com/document/d/1Pe_E0KWAfsJ6MCvKZ5aR28rTXX-rYLg13XxwXd6AALw/edit?usp=sharing In zero bubble, we have two methods `stage_backward_input` and `stage_backward_weight`. During `stage_backward_input` we compute the gradients of the input with respect to the stage outputs and also retain the graph of the autograd graph (different than 1F1B where `retain_graph=False`). The output / loss was still being retained across the next schedule step() because we return the loss to the user and use the output to the next step. To allow autograd to free the variables in the graph we need to detach the output/loss after we don't need to use it autograd anymore. Pre-fix: <img width="1021" alt="image" src="https://github.com/user-attachments/assets/6c8bf469-32b1-4dac-85ff-b97991f9f0e3"> Post-fix: <img width="1039" alt="image" src="https://github.com/user-attachments/assets/a1875038-e80b-4dd4-84f2-38727d7792dc"> without AC (7B model on titan): 10% memory improvement with AC (7B model on titan) 50% memory improvement Pull Request resolved: https://github.com/pytorch/pytorch/pull/138119 Approved by: https://github.com/wconstab, https://github.com/kwen2501	2024-10-24 00:44:03 +00:00
Nikita Shulga	889717aabd	[CI/CD] Disable split build (#138752 ) See https://github.com/pytorch/pytorch/issues/138750 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138752 Approved by: https://github.com/kit1980, https://github.com/huydhn	2024-10-23 22:38:30 +00:00
Nikita Shulga	1b31248933	[EZ] Fix typo in test_mps.py (#138738 ) s/emedding_weight/embedding_weight/ Stolen from `074766d9b4` Pull Request resolved: https://github.com/pytorch/pytorch/pull/138738 Approved by: https://github.com/atalman	2024-10-23 22:15:35 +00:00
drisspg	c92459488b	Fix test on windows (#138641 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138641 Approved by: https://github.com/huydhn	2024-10-23 21:53:32 +00:00
Animesh Jain	dd4dd85210	[hierarchical-compilation][inductor] Support invoke_subgraph HOP (#138031 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138031 Approved by: https://github.com/eellison ghstack dependencies: #137538, #138036, #137965	2024-10-23 21:32:14 +00:00
Gabriel Ferns	7622ede3cd	Add dump_launch_params config in triton/inductor (#137143 ) Summary: Moves the checking of TORCHINDUCTOR_DUMP_LAUNCH_PARAMS into the config module to pull it out of the critical path. Test Plan: Existing unit tests cover this env variable. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137143 Approved by: https://github.com/eellison	2024-10-23 21:20:46 +00:00
Edward Z. Yang	9eadd7434e	Refactor: Move _nested_int_aware_sort top level (#138693 ) I need to use it from some other places later in the PR stack Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/138693 Approved by: https://github.com/cyyever, https://github.com/Skylion007	2024-10-23 21:15:05 +00:00
Pian Pawakapan	9b77d3109b	[export] fix test_unbacked_bindings_for_divisible_u_symint (#138607 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/138607 Approved by: https://github.com/angelayi	2024-10-23 21:10:05 +00:00
Richard Barnes	dbd6ada8c3	Clean up a c10::optional and fix documentation (#138700 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/138700 Approved by: https://github.com/Skylion007	2024-10-23 20:42:28 +00:00
Tom Ritchford	8aedc649bd	Fix unbind_copy and add its decomposition (#134319 ) * Fixes https://github.com/pytorch/pytorch/issues/130829 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134319 Approved by: https://github.com/amjames, https://github.com/eellison	2024-10-23 19:13:44 +00:00
Catherine Lee	cd9c6e9408	Do not run CI on forks (#138714 ) Add `if: github.repository_owner == 'pytorch'` for some jobs that were missing it Fixes #138564 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138714 Approved by: https://github.com/huydhn, https://github.com/kit1980	2024-10-23 18:23:05 +00:00
Laith Sakka	ed313a5ca2	Introduce torch.sym_add, variadic add (#138660 ) Tested internally here: https://www.internalfb.com/diff/D64057744 This is a reland after previous internal failures. main change is ``` if min is None and max is None: torch._check_is_size(size) return ``` Partially addresses https://github.com/pytorch/pytorch/issues/128150 When you have big sums of values, we end up computing long chains of binary addition in our FX graph representation. Not only is this ugly, it also is quadratic, as the sympy.Add constructor is O(N) in number of arguments. Instead, ensure that we maintain the summation as a single FX node so we can do the entire addition all in one go. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/138660 Approved by: https://github.com/ezyang, https://github.com/bobrenjc93	2024-10-23 17:42:41 +00:00
Laith Sakka	72ea7ba89f	Generate slice.Tensor view operations instead of as_strided when split is used in the original program. (#137225 ) test_recompile assert that the changes do not add more recompilation by comparing with eager backend. The reason of this is because slice can be lowered in more efficient way. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137225 Approved by: https://github.com/zou3519	2024-10-23 17:42:16 +00:00
Tom Ritchford	1bc73f3157	Add decomposition for permute_copy (#130944 ) * Extracted from #129476 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130944 Approved by: https://github.com/amjames, https://github.com/eellison	2024-10-23 17:42:11 +00:00
Felix Su	c272526ea5	[SJD] [RFC] force setting last progress time (#138615 ) Summary: Currently, if watchdog + healthcheck are enabled via knobs but watchdog is disabled via SJD config, we observe a stuck when the watchdog loop attempts to open the watchdog file path. This is because the FileTimerClient that is usually set in TorchElasticWatchdog will not be set since disabling watchdog via SJD config bypasses the TorchElasticWatchdog initialization The workaround is to update the healthcheck time when calling `get_last_progress_time` Test Plan: Logs show that the progress time value is being changed despite client not being set Behavior when watchdog is enabled with SJD config is left unchanged Differential Revision: D64733766 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138615 Approved by: https://github.com/gag1jain	2024-10-23 15:29:00 +00:00
PyTorch MergeBot	cdfe1bffd1	Revert "[PGNCCL] Use non-blocking mode by default in eager init (#138527 )" This reverts commit 8fbf866904661b16cba4c799af81121557ba9da8. Reverted https://github.com/pytorch/pytorch/pull/138527 on behalf of https://github.com/jeanschmidt due to Seems to have introduce regressions on main, pull / linux-focal-cuda11.8-py3.10-gcc9 / test (distributed, 2, 3, linux.g4dn.12xlarge.nvidia.gpu) checking if revert will do ([comment](https://github.com/pytorch/pytorch/pull/138527#issuecomment-2432479338))	2024-10-23 14:49:49 +00:00
Jeremy Hadidjojo	2f007e5de5	Make trace log dir persist through multiple `set_logs()` calls (#137793 ) Summary: Currently, calling `torch._logging.set_logs()` resets the log directory leading to multiple tlparse outputs. This prevents the dir from resetting after the first call. Reviewed By: ezyang Differential Revision: D64118047 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137793 Approved by: https://github.com/ezyang	2024-10-23 14:23:03 +00:00
Alex Baden	ecf2240243	[Inductor] New Triton Attrs Descriptor Fixups (#138390 ) Fixes additional areas where we need to use the new Triton AttrsDescriptor if it is available. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138390 Approved by: https://github.com/jansel, https://github.com/huydhn	2024-10-23 14:13:49 +00:00
Jean Schmidt	75c6787a16	[CI] Introduces experiment `awsa100` to `inductor-perf-compare.yml` workflow using `_runner-determinator.yml` (#138204 ) Adds the job `get-test-label-type` in `.github/workflows/inductor-perf-compare.yml` checking for the experiment `awsa100`. It is then used by the job `linux-focal-cuda12_1-py3_10-gcc9-inductor-build` to define the prefix for the runners that will run the benchmark. Those runners temporarily accept the labels `awsa100.linux.gcp.a100` and `linux.aws.a100`. This is used so we can migrate via experimentation from `linux.gcp.a100`. After successfully experiment with those instances we will remove those labels and update the workflows to use `linux.aws.a100` and decomisson the gcp fleet. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138204 Approved by: https://github.com/ZainRizvi, https://github.com/huydhn	2024-10-23 13:47:26 +00:00
Richard Barnes	04103f6ae9	Eliminate c10 string_utils (#138499 ) Test Plan: Sandcastle Pull Request resolved: https://github.com/pytorch/pytorch/pull/138499 Approved by: https://github.com/swolchok	2024-10-23 13:40:19 +00:00
Sun, Jiayi	c2d26418c3	[Quant][Inductor] expand quantization conv-binary(-unary) pattern fusion inside inductor (#138051 ) ### Summary Expand quantization conv-binary(-unary) pattern fusion inside inductor to support the following two patterns: Pattern 1: ``` Conv(X) extra input \ / Add \| Optional(relu) \| Y ``` Pattern 2: ``` extra input Conv(X) \ / Add \| Optional(relu) \| Y ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/138051 Approved by: https://github.com/leslie-fang-intel, https://github.com/jansel, https://github.com/jgong5	2024-10-23 13:12:17 +00:00
chuanqiw	2f1842fa83	[CD] fix xpu support packages version (#138189 ) Works for https://github.com/pytorch/pytorch/issues/114850 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138189 Approved by: https://github.com/EikanWang, https://github.com/malfet, https://github.com/atalman	2024-10-23 12:25:43 +00:00
Ke Wen	8fbf866904	[PGNCCL] Use non-blocking mode by default in eager init (#138527 ) ### Why use non-blocking mode in eager init? For overlapping comm init and model init, etc. ![image](https://github.com/user-attachments/assets/9b0bf7a9-be26-4d16-827b-dbe861f083cd) ### Why can we set non-blocking as default? If the setting is dangling -- i.e. not passed in by user nor set via env -- `ProcessGroupNCCL` can have some preferred logic. And torch-level API semantics does not change whether the NCCL comm is blocking or non-blocking (handled within `ProcessGroupNCCL`). ### Why not make non-blocking default for lazy mode as well? PR https://github.com/pytorch/pytorch/pull/137544 tried it. Two reasons why that's not preferred today: 1. It is hard -- too big a blast. 2. There is no gain by doing lazy init in non-blocking mode, because the right next CPU call is a collective, and we will block there waiting for comm to be ready, so same effect as blocked init, no "opening" compared to eager mode. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138527 Approved by: https://github.com/wconstab ghstack dependencies: #137855, #138488, #138374, #138384	2024-10-23 08:51:54 +00:00
Sheng Fu	2d7e586c13	Fixed dead lock in execution trace (#136892 ) Summary: This DIFF is to fix dead lock issue in execution issue. ExecutionTraceObserver get a lock in recordOperatorStart and onFunctionExit. However, inside these two functions, the input/ouput values are evaluated, which will triger python GIL in some use cases. In this case, the lock order is ET locker -> GIL. One of the ads application get GIL first, then call all-gather to collect some metrics from all ranks. When ET is on, all-gather is captured by ET observer. In this case, the lock order is: GIL -> ET locker That is the reason why dead lock happens. To fix it, I changed the ET locker scope, so the input/output evaluation is no longer inside the scope of the ET locker. Test Plan: buck2 test mode/opt caffe2/test:test_profiler_cuda Differential Revision: D63556608 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136892 Approved by: https://github.com/aaronenyeshi	2024-10-23 07:53:56 +00:00
titaiwangms	cab5f54dee	[ONNX] Fix sequence handling in graph building (#138656 ) Previous to this PR, op.Concat is called without required attributes: axis, and val and arg seems wrongly coded. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138656 Approved by: https://github.com/justinchuby	2024-10-23 07:47:58 +00:00
Ting Lu	5402677021	add CUDA 12.6 to conda docker image (#138417 ) Adds cuda 12.6 to common installation script. Adds cuda 12.6 to conda docker image build matrix. fixes #138440 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138417 Approved by: https://github.com/cyyever, https://github.com/atalman	2024-10-23 07:30:51 +00:00
Bob Ren	5ceef8c470	Add support for SymFloats in split_module fx pass (#138599 ) As discussed with @ezyang, this set of diffs are extracting fixes to problems discovered to flipping `specialize_float=False` in https://github.com/pytorch/pytorch/pull/137782. Since these codepaths are exercised in existing tests, I'm going to bias towards shipping speed and put these up with the primary test plan as the global CI. These code paths are all tested via existing tests when `specialize_float=False` and it feels a bit wonky to add more gated tests that only test behavior when this flag is True, especially since these code paths are already covered. That being said, I'm happy to add individual tests if reviewers insist or have a different POV. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138599 Approved by: https://github.com/ezyang	2024-10-23 06:56:13 +00:00
Bob Ren	96c86758e2	Support conditionals on sym node variables in the __bool__ and __len__ case (#138595 ) As discussed with @ezyang, this set of diffs are extracting fixes to problems discovered to flipping `specialize_float=False` in https://github.com/pytorch/pytorch/pull/137782. Since these codepaths are exercised in existing tests, I'm going to bias towards shipping speed and put these up with the primary test plan as the global CI. These code paths are all tested via existing tests when `specialize_float=False` and it feels a bit wonky to add more gated tests that only test behavior when this flag is True, especially since these code paths are already covered. That being said, I'm happy to add individual tests if reviewers insist or have a different POV. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138595 Approved by: https://github.com/ezyang	2024-10-23 06:44:09 +00:00
titaiwangms	72dde6e84b	[ONNX] Avoid optimize `onnx_dynamo-fallback` (#138265 ) Previous to this PR, when a model fails to be exported, it falls back to try with the legacy torchscript exporter. However, we didn't stop when it's exported with torchscript exporter, an optimization is applied to the graph. It's ideal that the optimization can also boost the performance of the model exported with the legacy torchscript exporter, but currently, for benchmarking purpose and what fallback guarantee to the users, we should keep it simple and only return the graph. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138265 Approved by: https://github.com/xadupre, https://github.com/justinchuby	2024-10-23 04:13:32 +00:00
Prajesh Praveen Anchalia	bb65c9b883	[PyTorch] Classify Unsupported mutated Dynamic Shapes as User Error (#137054 ) Summary: We don't need an assert on for unsupported dyn shape inputs, removing the assert and raising a user exception instead. Differential Revision: D63661569 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137054 Approved by: https://github.com/bdhirsh	2024-10-23 03:15:37 +00:00
cyy	fbd14315f9	Update ruff to 0.7.0 (#138597 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/138597 Approved by: https://github.com/ezyang	2024-10-23 03:00:30 +00:00
Sam Larsen	06b5330674	[easy] Log subproc pool creation (#138642 ) Summary: Request from internal to log subproc pool creation Test Plan: ``` $ TORCH_LOGS=+torch._inductor.async_compile python ~/add.py I1022 14:12:41.915000 444394 torch/_inductor/async_compile.py:165] Creating subprocess pool with 32 workers ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/138642 Approved by: https://github.com/eellison	2024-10-23 02:41:42 +00:00
cyy	86cca3fb05	[1/N] Don't skip ASAN on some tests (#138571 ) Clang15's ASAN is new enough so that it's possible to re-evaluate the disabled ASAN tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138571 Approved by: https://github.com/ezyang	2024-10-23 02:38:45 +00:00
Henry Tsang	d437df342b	[tests] fix broken tests caused by AotEagerAndRecordGraphs typo (#138492 ) Summary: Name change happened in https://github.com/pytorch/pytorch/pull/138231 AttributeError: module 'torch._dynamo.testing' has no attribute 'AOTEagerAndRecordGraphs'. Did you mean: 'AotEagerAndRecordGraphs'? Test Plan: ci Differential Revision: D64704686 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138492 Approved by: https://github.com/aakhundov	2024-10-23 02:25:21 +00:00
Wouter Devriendt	fee2f331ce	Update torchbench.txt (#138569 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/138569 Approved by: https://github.com/huydhn, https://github.com/malfet	2024-10-23 01:42:25 +00:00
Ke Wen	f2ebf6d94a	[PGNCCL] Ensure comm is ready before all accesses (#138384 ) Previously we only wait for comm to become ready after its initialization. That's not enough. There are other NCCL APIs that can cause the comm to be InProgress, e.g. P2P calls, commSplit, commFinalize, etc. Therefore, we just ensure comm is ready every "next time" we need to access ncclComm. The place to add such gate keeper is `getNcclComm`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138384 Approved by: https://github.com/shuqiangzhang, https://github.com/fduwjj ghstack dependencies: #137855, #138488, #138374	2024-10-23 01:36:58 +00:00
Mikayla Gawarecki	37149d032c	Fix .to(cpu) for Storage (#138011 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138011 Approved by: https://github.com/albanD	2024-10-23 01:31:48 +00:00
Bin Bao	555bddbef7	[AOTI][refactor] Move use_minimal_arrayref_interface logic (#138250 ) Summary: Move use_minimal_arrayref_interface specific logic from CppWrapperCpu to CppWrapperCpuArrayRef. This is a copy-on-write style refactor, to simply the default AOTI generated code. Differential Revision: [D64598715](https://our.internmc.facebook.com/intern/diff/D64598715) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138250 Approved by: https://github.com/chenyang78 ghstack dependencies: #138544, #138379	2024-10-23 01:00:34 +00:00
Bin Bao	2cee5a39ad	[AOTI] Fix check_model_with_multiple_inputs in test_aot_inductor (#138379 ) Summary: Add missing use_minimal_arrayref_interface setting to check_model_with_multiple_inputs. Differential Revision: [D64635211](https://our.internmc.facebook.com/intern/diff/D64635211) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138379 Approved by: https://github.com/hl475 ghstack dependencies: #138544	2024-10-23 00:54:29 +00:00
Richard Barnes	d428d81c7f	Remove some pre-cpp17 stuff (#138410 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138410 Approved by: https://github.com/Skylion007	2024-10-23 00:38:03 +00:00
Tugsbayasgalan Manlaibaatar	f4b3813989	Wrap autograd and autocast ops in training IR (#138516 ) Differential Revision: [D64732361](https://our.internmc.facebook.com/intern/diff/D64732361) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138516 Approved by: https://github.com/yushangdi ghstack dependencies: #138261	2024-10-23 00:37:54 +00:00
PyTorch MergeBot	9f7b987087	Revert "[Inductor] New Triton Attrs Descriptor Fixups (#138390 )" This reverts commit 215999452eb5517213b3a31f72eb9a7e843d12a0. Reverted https://github.com/pytorch/pytorch/pull/138390 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but it still has another lint error ([comment](https://github.com/pytorch/pytorch/pull/138390#issuecomment-2430566004))	2024-10-23 00:37:28 +00:00
Tugsbayasgalan Manlaibaatar	69f18587d6	Move test_serialize to training IR (#138261 ) Differential Revision: [D64572253](https://our.internmc.facebook.com/intern/diff/D64572253) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138261 Approved by: https://github.com/yushangdi	2024-10-23 00:32:32 +00:00
Laith Sakka	662d07e93e	Remove parallel_and and parallel_or (#138135 ) Not used, suggested by @ezyang Pull Request resolved: https://github.com/pytorch/pytorch/pull/138135 Approved by: https://github.com/ezyang	2024-10-23 00:22:22 +00:00
cyy	38d3c27849	[1/N] Enable cppcoreguidelines-special-member-functions (#137405 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/137405 Approved by: https://github.com/ezyang	2024-10-23 00:16:53 +00:00
wz337	7e951c1675	[EZ][DTensor] Update DTensor readme to use the new import path (#138625 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/138625 Approved by: https://github.com/XilunWu	2024-10-23 00:08:36 +00:00
William Wen	3441ea7d74	[dynamo] reset compiler stance after test (#138277 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138277 Approved by: https://github.com/anijain2305, https://github.com/jansel	2024-10-23 00:07:33 +00:00
PyTorch UpdateBot	a825667670	[executorch hash update] update the pinned executorch hash (#135287 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned executorch hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135287 Approved by: https://github.com/pytorchbot, https://github.com/huydhn Co-authored-by: Huy Do <huydhn@gmail.com>	2024-10-22 23:40:57 +00:00
eellison	5942b29850	Disabling amp context when invoking compiler (#138624 ) Fix for https://github.com/pytorch/pytorch/issues/133974 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138624 Approved by: https://github.com/bdhirsh, https://github.com/drisspg	2024-10-22 23:21:55 +00:00
Alex Baden	215999452e	[Inductor] New Triton Attrs Descriptor Fixups (#138390 ) Fixes additional areas where we need to use the new Triton AttrsDescriptor if it is available. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138390 Approved by: https://github.com/jansel	2024-10-22 23:16:05 +00:00
PyTorch MergeBot	10f16cc7da	Revert "Make Context to be Device-agnostic Step by Step (2/N) (#136526 )" This reverts commit 8aacbee8e0d6c03096f2ce94b70e2a8fab17ee81. Reverted https://github.com/pytorch/pytorch/pull/136526 on behalf of https://github.com/wdvr due to this one has failing internal tests, not related to a landrace with #138398 - reverting this one ([comment](https://github.com/pytorch/pytorch/pull/136526#issuecomment-2430460176))	2024-10-22 22:53:56 +00:00
Jesse Cai	39bfba3f56	[sparse] add search for optimal alg_id to torch.compile (#137427 ) Summary: This PR adds a lowering for `torch._cslt_sparse_mm` to find the optimal alg_id and cache it when running with `torch.compile` Seeing speedups on both bfloat16 and float8 dtypes: <img width="641" alt="Screenshot 2024-10-17 at 2 10 38 PM" src="https://github.com/user-attachments/assets/b928cd11-32a3-43e5-b209-8e4028896f0b"> <img width="1274" alt="Screenshot 2024-10-17 at 1 39 03 PM" src="https://github.com/user-attachments/assets/d9edd684-a8ec-46fd-b3da-2e76dbcb7bb6"> * `torch._cslt_sparse_mm_search` has been modified to return optimal split-k parameters as well as max alg_id. * max_id is now available in `torch.backends.cusparselt` via `torch.backends.cusparselt.get_max_alg_id()` * fixed meta registrations for float8 Test Plan: python test/test_sparse_semi_structured.py Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/137427 Approved by: https://github.com/cpuhrsch	2024-10-22 22:39:42 +00:00
Nikita Shulga	b4cfb9c014	[EZ] Use `at::detail` nested namespace in Dispatch.h (#138633 ) Instead of `namespace at { namespace detail {` Pull Request resolved: https://github.com/pytorch/pytorch/pull/138633 Approved by: https://github.com/Skylion007	2024-10-22 22:13:21 +00:00
Bin Bao	54fbd897d9	[AOTI][refactor] Clean up test_aot_inductor skip list (#138544 ) Summary: Remove skips for already fixed tests. Change remaining skip to xfail so that the failure list can be more proactively maintained. Differential Revision: [D64761257](https://our.internmc.facebook.com/intern/diff/D64761257) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138544 Approved by: https://github.com/chenyang78, https://github.com/hl475	2024-10-22 21:32:49 +00:00
James Wu	a16476b671	Add support for adding extra metadata to chromium events, log to separate columns (#138477 ) This diff does a few things: ## Add metadata to events in progress Adds the ability to add extra metadata to Chromium Events via `add_event_data`. Metadata can only be added to chromium events that have started, but not ended (so, in progress events) - When you add the data, the metadata is appended to the metadata when you call log_event_end(). - The metadata appears in chromium events in tlparse. It also gets logged to scuba. ## New `dynamo` chromium event We add a new `dynamo` chromium event to the top of the stack, where we collect various metadata found in dynamo_compile. So the new order of events goes: ``` __start__ -> dynamo (dynamo compile metrics) -> entire_frame_compile (compile.inner) -> backend_compile (i.e. aotdispatch) -> create_aot_dispatch_function -> inductor_compile -> ... ``` BackwardCompilationMetrics doesn't have any dynamo specific information (as it's mostly inductor timings). So we don't include that here. FAQ: Why can't we use `entire_frame_compile` as the event? This is mostly due to backward compatibility with `dynamo_compile`. `dynamo_compile` collects CompilationMetrics outside of `compile.compile_inner`, and uses `dynamo_timed` to grab timings from phases of the compiler, including `entire_frame_compile`. So we don't have a CompilationMetric object until after an `entire_frame_compile` event ends! Separately, `dynamo` as a name for all of dynamo compile is more descriptive than `entire_frame_compile`, imo. ## Log metadata as separate columns (Meta only): Separately, this also changes the `metadata` column in PT2 Compile Events. Instead of logging a single metadata column in JSON, it separates the JSON into separate columns. This is much better for data analysis. Now that this table is more mature, I think logging keys to separate columns is a better system.Differential Revision: [D64696287](https://our.internmc.facebook.com/intern/diff/D64696287/) NOTE FOR REVIEWERS: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D64696287/)! Pull Request resolved: https://github.com/pytorch/pytorch/pull/138477 Approved by: https://github.com/aorenste	2024-10-22 21:17:44 +00:00
Matthew Francis-Landau	3b2b5486ea	Fixes issue with torch._dynamo.assume_constant_result with global functions (#132431 ) This PR fixes an issue with `torch._dynamo.assume_constant_result` causing global values to be overwritten. Currently `torch._dynamo.assume_constant_result` saves the constant result into a global variable derived from the name of the function. This causes that function to be overwritten in the global scope. This PR checks that the name is unique in the global scope as well, avoiding the issue of overriding the function. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132431 Approved by: https://github.com/jansel	2024-10-22 21:14:26 +00:00
Yiming Zhou	e3af290165	[export] Add retraceability_non_strict to tests (#138380 ) Summary: We expand the tests to cover retraceability_non_strict. Currently failing tests are skipped. Test Plan: ``` buck2 test @//mode/dev-nosan //caffe2/test:test_export -- -r _retraceability ``` Differential Revision: D64611532 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138380 Approved by: https://github.com/angelayi	2024-10-22 21:05:51 +00:00
Nikita Shulga	d1be61ce4e	Update copyrights to 2024 (#138638 ) Spiritual successor of https://github.com/pytorch/pytorch/pull/119413 + CPP docs copyright update as well Fixes https://github.com/pytorch/pytorch/issues/138630 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138638 Approved by: https://github.com/atalman	2024-10-22 21:00:58 +00:00
dependabot[bot]	dbd0a39c79	Bump webrick from 1.7.0 to 1.8.2 in /ios/TestApp (#136593 ) Bumps [webrick](https://github.com/ruby/webrick) from 1.7.0 to 1.8.2. - [Release notes](https://github.com/ruby/webrick/releases) - [Commits](https://github.com/ruby/webrick/compare/v1.7.0...v1.8.2) --- updated-dependencies: - dependency-name: webrick dependency-type: indirect ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2024-10-22 13:32:50 -07:00
Joel Schlosser	f089d5ffef	Improve input validation for NJT pointwise ops (#138602 ) Before this PR, NJT would dispatch e.g. `NJT * nested_int` to `mul.Tensor`, wrongly interpreting the SymInt as a tensor and outputting garbage. This PR verifies that there are no nested ints in the list of args before dispatching for pointwise ops. I originally tried checking that `the number of passed tensor args == the number of func schema tensor args`, but this wrongly disallows `nt * 2`, which (non-intuitively to me at least at first) dispatches via the `mul.Tensor` overload. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138602 Approved by: https://github.com/soulitzer	2024-10-22 20:13:12 +00:00
cyy	1c77b13c06	[6/N] Fix extra warnings brought by clang-tidy-17 (#138572 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/138572 Approved by: https://github.com/Skylion007	2024-10-22 19:46:38 +00:00
Ti-Tai Wang	a71723bf12	[ONNX] Add complex constant support (#138279 ) Transform complex python constant to float representation as well, like what we have with tensors. PS: I find it's not reasonable to add "complex->float" in IR side, so I put it here. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138279 Approved by: https://github.com/justinchuby Co-authored-by: Justin Chu <justinchuby@users.noreply.github.com>	2024-10-22 19:42:59 +00:00
Mark Kim-Mulgrew	c7a20939b4	Remove unused enforce_cond_guards_match Dynamo feature flag. (#138589 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138589 Approved by: https://github.com/clee2000	2024-10-22 19:36:01 +00:00
atalman	078dca1ce8	Aarch64 binary builds - fix passing env_file to Docker (#138588 ) Aarch64 builds skipped the logic of sourcing binary env file. And as a result PYTORCH_EXTRA_INSTALL_REQUIREMENTS passed to Aarch64 builds have not included triton dependency constraint. This PR makes sure Aarch64 builds follow same path as our regular manywheel builds. To work around this issue we had to inject triton in aarrch64 builds for release 2.5, which is not ideal: https://github.com/pytorch/builder/pull/2011 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138588 Approved by: https://github.com/jeanschmidt, https://github.com/malfet	2024-10-22 19:04:19 +00:00
eqy	c0e8458aab	[Flex Attention] Don't compute fill order to compute stride order just to get fill order back (#138376 ) Was a bit confusing to read when working on #138354 "computer-assisted proof" ``` import random def argsort(seq): # preserve original order for equal strides getter = seq.__getitem__ a_r = range(len(seq)) return list(reversed(sorted(a_r, key=getter, reverse=True))) # noqa: C413 def stride_order2fill_order(order): """ Convert stride order to fill order For channel last format, stride order = [3, 0, 2, 1] and fill order = [1, 3, 2, 0] """ lookup = {pos: idx for idx, pos in enumerate(order)} fill_order = [lookup[i] for i in range(len(order))] return fill_order def get_stride_order(seq): """ Convert strides to stride order """ sorted_idx: List[int] = argsort(seq) out = [0 for _ in range(len(seq))] a = sorted_idx.copy() for i, elem in enumerate(sorted_idx): out[elem] = i fillorder = stride_order2fill_order(out) assert fillorder == sorted_idx return out for _ in range(1000): a = [0, 1, 2, 3] random.shuffle(a) get_stride_order(a) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/138376 Approved by: https://github.com/drisspg	2024-10-22 18:40:39 +00:00
Max Podkorytov	2dab4ccb65	[Inductor][ROCm][CK] add CK grouped conv2d fwd kernels to ROCm codegen (#137947 ) Plug into lowering and end to end test in a later PR Instance parsing companion PR https://github.com/ROCm/composable_kernel/pull/1585 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137947 Approved by: https://github.com/ColinPeppler, https://github.com/chenyang78	2024-10-22 18:25:23 +00:00
Zain Rizvi	6e4c19289c	[EZ] [BE] Remove (now) unused scale config (#138511 ) Final step of moving scale config files to test-infra repo. Details in https://github.com/pytorch/test-infra/pull/5767 Scale configs are now read from test-infra. This PR is just cleaning up stale files Pull Request resolved: https://github.com/pytorch/pytorch/pull/138511 Approved by: https://github.com/clee2000	2024-10-22 18:08:42 +00:00
Stefan-Alin Pahontu	f7e36d8d6f	Fix for MSVC problem on Windows Arm64 (#136765 ) This PR proposes a workaround for an internal issue introduced in MSVC 14.37 for Windows Arm64 target. It is still an ongoing problem. The fix will be released with the future versions of Visual Studio 2022 but until then the changes to cpu/vec/vec_base.h should be sufficient. We also opened a new ticket on Visual Studio Developer Community, it can be found here: https://developercommunity.visualstudio.com/t/MSVC-loop-unrolling-problem-194033813-/10720692 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136765 Approved by: https://github.com/malfet Co-authored-by: Stefan-Alin Pahontu <56953855+alinpahontu2912@users.noreply.github.com> Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com> Co-authored-by: Eli Uriegas <1700823+seemethere@users.noreply.github.com>	2024-10-22 18:07:58 +00:00
PyTorch MergeBot	fc9093c3d2	Revert "Remove C10_DEPRECATED (#138406 )" This reverts commit 70ec86d7542d461ff6f01ba1a1c9a4f38637af8e. Reverted https://github.com/pytorch/pytorch/pull/138406 on behalf of https://github.com/wdvr due to failing internal tests - see D64714374 ([comment](https://github.com/pytorch/pytorch/pull/138406#issuecomment-2429912896))	2024-10-22 18:00:41 +00:00
Catherine Lee	cc93c1e5e4	Upload artifacts during test run (#125799 ) Zip and upload artifacts while run_test is running Upgrade boto3 because I get errors about not having `botocore.vendored.six.move` if I don't Pull Request resolved: https://github.com/pytorch/pytorch/pull/125799 Approved by: https://github.com/huydhn	2024-10-22 16:48:57 +00:00
Animesh Jain	2e48788a35	[hierarchical-compilation][invoke_subgraph] Use tracing context to cache artifacts of dispatch keys (#137965 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137965 Approved by: https://github.com/zou3519 ghstack dependencies: #137538, #138036	2024-10-22 15:33:42 +00:00
Animesh Jain	e045e8f0df	[hierarchical-compilation][invoke_subgraph] Graph break on input mutation or aliasing (#138036 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138036 Approved by: https://github.com/zou3519 ghstack dependencies: #137538	2024-10-22 15:33:42 +00:00
Animesh Jain	4dd4d38ca9	[hierarchical-compilation][hop] Introduce invoke_subgraph (#137538 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137538 Approved by: https://github.com/zou3519	2024-10-22 15:33:34 +00:00
Jeff Daily	046f02d2de	[ROCm] index_put performance improvement (#138259 ) On ROCm, using a non-vectorized index_put kernel provides ~2x perf improvement over the hipified CUDA kernel. None of the existing unit tests were exercising the large index case so a new unit test was added. It was also noted that the scale value in the original kernel was hard-coded to 1.0 which would be a no-op, so it was removed from the simplified rocm kernel. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138259 Approved by: https://github.com/xw285cornell, https://github.com/leitian, https://github.com/eqy	2024-10-22 15:21:43 +00:00
Bin Bao	2827befe61	[AOTI][reland] Fix test_index_put_with_none_index_cpu_with_stack_allocation (#138541 ) Summary: The problem happened after splitting CppWrapperCpu and CppWrapperCpuArrayRef, because CppWrapperCpuArrayRef.generate_index_put_fallback missed a statement. Running test_aot_inductor.py as a whole didn't reveal the problem, but running test_index_put_with_none_index_cpu_with_stack_allocation individually did. Digging deeper, the root cause is init_backend_registration has incorrectly cached CPU CppWrapperCodegen class, which means CppWrapperCpuArrayRef was never picked when running test_aot_inductor.py as a whole. To fix the problem, all the ArrayRef tests are split into a separate file. Also a code checking is added to regex match AOTInductorModelRunMinimalArrayrefInterface so this kind of false passing signal won't be unnoticed. Differential Revision: [D64734106](https://our.internmc.facebook.com/intern/diff/D64734106) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138541 Approved by: https://github.com/frank-wei	2024-10-22 14:17:27 +00:00
Colin L. Rice	bb8bc7d6b3	config: simplify most of the config handling and fix some bugs (#138377 ) This PR combines a number of cleanups in one PR. If any of the specific cleanups don't seem to make sense, let me know and I can remove them. Cleanups - This PR adds a set of test suites for the config module code, which handles basically all the APIs and ways it is used. Please let me know if you see anything critical that is not tested that I missed. This test suite is primarily used as the regression test suite for later changes in this diff. Note that there is some dynamo specific testing of the config module, but it isn't as verbose. - I removed all internal usage of shallow_copy_dict. Those usages could all use the deep copy, and did not depend on the reference behavior of certain config values that shallow_copy_dict allows. - I removed shallow copy semantics for configuration with a deprecation warning. I think this requires a release note, so hopefully I did that correctly. Let me know if we want to continue to expose shallow copy value semantics, but I just can't find a case where I expect anyone would want it. It also complicated later internal changes to the API (i.e. breaking apart various layers of the config changes). - I fixed what I believe is a bug in how hashes are calculated on configs. In particular, if you got the hash, then made a config change, and then got the hash again, it would not update the hash. @oulgen, please let me know if I'm misunderstanding this behavior and it is desired. - I switched our multiple implementations of iterating through the dictionary to a single one. This is primarily to make later changes easier, but it also makes it clear how inconsistent our various config ignoring options are. Let me know if people would be interested in me unifying the various options for ignoring config values. - I updated the test patcher (not the performance critical one, just the normal one), to use __setattr__ and __getattr__ to remove direct API access to the underlying config fetcher. For release notes, Not sure exactly how to communicate this, but something like "ConfigModule.to_dict, and ConfigModule.shallow_copy_dict no longer retain their shallow copy semantics, which allowed reference values objects to be modified. If you wish to modify the config object, call load_config explicitly". Pull Request resolved: https://github.com/pytorch/pytorch/pull/138377 Approved by: https://github.com/ezyang, https://github.com/jansel, https://github.com/jovianjaison	2024-10-22 13:40:26 +00:00
Edward Z. Yang	1b61313acd	Add type stub for SymInt.rsub (#138543 ) Fixes https://github.com/pytorch/pytorch/issues/138478 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/138543 Approved by: https://github.com/malfet	2024-10-22 13:27:32 +00:00
Pearu Peterson	8c840fb921	Add out_dtype kw argument to optimize_bsr_dense_addmm (#136626 ) As in the title. Addresses the task in https://github.com/pytorch/ao/pull/821#issuecomment-2373290266 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136626 Approved by: https://github.com/amjames, https://github.com/cpuhrsch	2024-10-22 09:52:25 +00:00
Simon Fan	5a13282c75	[compiled autograd] tls access helpers (#138061 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138061 Approved by: https://github.com/yf225 ghstack dependencies: #137953, #137821	2024-10-22 08:03:52 +00:00
Simon Fan	49fa437097	[compiled autograd] Compiled autograd configs in TLS (#137821 ) Multithreaded doesn't work yet, this adds python side TLS only for the python side state Pull Request resolved: https://github.com/pytorch/pytorch/pull/137821 Approved by: https://github.com/jansel, https://github.com/yf225 ghstack dependencies: #137953	2024-10-22 08:03:52 +00:00
Simon Fan	75259145ec	[compiled autograd] directly use python Logger class in cpp (#137953 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137953 Approved by: https://github.com/jansel, https://github.com/yf225	2024-10-22 08:03:52 +00:00
angelayi	60c1433041	[aoti] Cond symint input support (#138373 ) If the input is a symint, we don't want to add the aoti_torch_assign_tensors_out Pull Request resolved: https://github.com/pytorch/pytorch/pull/138373 Approved by: https://github.com/larryliu0820, https://github.com/desertfire	2024-10-22 07:53:22 +00:00
Pian Pawakapan	51045e6251	make DimHints compatible with Dims (#138490 ) Previously we'd been raising UserErrors when `Dim()` and DimHints (`Dim.AUTO/Dim.DYNAMIC`) were both specified in `dynamic_shapes`, this PR stops that, and uses `Dim()` objects to guide DimHints. The key to this was making the `EqualityConstraint` class happy when it checks that inferred equivalence relations were specified in the original `dynamic_shapes` spec, and this introduces a `RelaxedConstraint` object to mark the hinted dimensions, so equality checks between `RelaxedConstraints` and other constraints are treated as valid. Current behavior is that: ``` class Foo(torch.nn.Module): def forward(self, x, y): return x - y inputs = (torch.randn(4, 4), torch.randn(4, 4)) shapes = { "x": (Dim.AUTO, Dim("d1", min=3)), "y": (Dim("d0", max=8), Dim.DYNAMIC), } ep = export(Foo(), inputs, dynamic_shapes=shapes) ``` The dimensions marked `AUTO` and `DYNAMIC` will have max & min ranges of 8 & 3 respectively. Note that inferred equality between `Dim()` objects & `Dim.STATIC` will still raise errors - `Dim()` suggests not specializing to a constant. Differential Revision: D64636101 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138490 Approved by: https://github.com/avikchaudhuri	2024-10-22 07:43:48 +00:00
drisspg	9a9a0abc28	[SDPA-CUDNN] Make CuDNN Attention Opt in (#138522 ) # Summary Currently we have a `cudnn_order` that says on H100 w/ new enough CuDNN backend (we ship a 9.1 version in OSS) try to run CuDNN attention first. We have already encountered a few bugs with the release of 2.5: 1. https://github.com/pytorch/pytorch/issues/138529 2. https://github.com/huggingface/diffusers/issues/9704 3. https://github.com/pytorch/pytorch/pull/138354 In light of the above we are going to make the CuDNN backend Opt-in by default. This can be done easily with the context manager for choosing backends I.e.: ``` Python from torch.nn.attention import sdpa_kernel, SDPBackend with sdpa_kernel(SDPBackend.CUDNN_ATTENTION): out = F.scaled_dot_product_attention(q, k, v) ``` This PR puts the CuDNN backend as the lowest precedence in the backend list, meaning that the Math backend will always be chosen unless disabled (which is done via the context manager). Cc @atalman Pull Request resolved: https://github.com/pytorch/pytorch/pull/138522 Approved by: https://github.com/ngimel, https://github.com/eqy, https://github.com/malfet	2024-10-22 07:23:06 +00:00
Gabriel Ferns	2b4af6fa74	Mark torch.get_device as overridable at the python level (#132706 ) Summary: - add a value to `get_testing_overrides` function for `torch.get_device()` - remove `torch.get_device()` from the `get_ignored_functions` list Test Plan: Existing override testing infra, which should pick up the updates to these two variables. Closes the loop on: https://github.com/pytorch/pytorch/pull/132567 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132706 Approved by: https://github.com/ezyang	2024-10-22 07:20:42 +00:00
Pian Pawakapan	84e5f34fd1	bug in unbacked_bindings for au0 (#138136 ) Summary: we were storing au0 instead of u0 in unbacked_bindings / unbacked_var_to_val Test Plan: - Differential Revision: D64508936 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138136 Approved by: https://github.com/ezyang	2024-10-22 07:04:30 +00:00
Sam Larsen	a80b87353c	[pt2] Log is_forward field to dynamo_compile scuba table (#138505 ) Differential Revision: [D64711721](https://our.internmc.facebook.com/intern/diff/D64711721) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138505 Approved by: https://github.com/oulgen	2024-10-22 05:50:49 +00:00
Chien-Chin Huang	0b4a071a1d	[CP] Implement AllGather based context parallelism (#132820 ) Summary: This implementation does not utilize the benefit that after allgather we can directly perform the SDPA without doing the ring-based SDPA, but we can overlap the communication with the first sharded kv computation. This implementation shows some performance benefit and memory saving compared to the original alltoall implementation in certain cases. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132820 Approved by: https://github.com/XilunWu	2024-10-22 05:25:50 +00:00
Ke Wen	6b29d40e9b	[PGNCCL] Add default value for `nccl_nonblocking_timeout` (#138374 ) - Added default value for `nccl_nonblocking_timeout` (30 mins, previous: -1). - Reuse C10D_CHECK_TIMEOUT in other CHECK macros Pull Request resolved: https://github.com/pytorch/pytorch/pull/138374 Approved by: https://github.com/eqy ghstack dependencies: #137855, #138488	2024-10-22 05:06:18 +00:00
Syed Tousif Ahmed	03c72976a5	Properly uses ref-counting for torch.cuda.use_mem_pool (#133600 ) This PR refactors some ref-counting functionality out of `beginAllocateToPool` and `releasePool`. The ref-counting logic is then used in construction and destruction of `torch.cuda.MemPool`. The `use_count` variable in the CUDACachingAllocator is essentially a refcount of how many context managers are using the pool. Since we are now lifting up the MemPool abstraction to the user, the MemPool object itself now needs to hold a an extra reference as well. Part of https://github.com/pytorch/pytorch/issues/124807. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133600 Approved by: https://github.com/eqy, https://github.com/ezyang	2024-10-22 03:21:53 +00:00
Colin Peppler	89067402d4	[easy] in ROCmTemplate set kwargs when creating Buffer (#138521 ) Summary: https://github.com/pytorch/pytorch/pull/137768 makes Inductor IR kw only Test Plan: CI Differential Revision: D64723804 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138521 Approved by: https://github.com/tenpercent, https://github.com/chenyang78	2024-10-22 03:13:16 +00:00
cyy	f881094366	Use Wmissing-prototypes on torch_cuda (#136080 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/136080 Approved by: https://github.com/ezyang	2024-10-22 02:04:19 +00:00
Tugsbayasgalan Manlaibaatar	9f7c26bef3	Fix training IR bug by changing passes order (#138292 ) Inserting runtime_assertions cause gm to have different names but the graph signature was populated earlier. To avoid this kind of errors in the future, I refactored these steps into a helper function. Differential Revision: [D64576251](https://our.internmc.facebook.com/intern/diff/D64576251) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138292 Approved by: https://github.com/avikchaudhuri ghstack dependencies: #138266	2024-10-22 01:24:14 +00:00
Sergii Dymchenko	012ff2a0aa	Don't try to load cufile (#138501 ) Trying to loading it caused a big issue with 2.5.0 release - https://github.com/pytorch/pytorch/issues/138324 cufile is not actually used currently by default, see #133489 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138501 Approved by: https://github.com/atalman, https://github.com/mikaylagawarecki, https://github.com/malfet	2024-10-22 01:13:27 +00:00
Tugsbayasgalan Manlaibaatar	5adc33d3b8	Training IR should preserve custom metadata (#138266 ) Differential Revision: [D64576252](https://our.internmc.facebook.com/intern/diff/D64576252) @diff-train-skip-merge Pull Request resolved: https://github.com/pytorch/pytorch/pull/138266 Approved by: https://github.com/yushangdi	2024-10-22 01:09:56 +00:00
Shunting Zhang	0a38c0ec89	[inductor] add a threshold for membw saving during fusion (#136782 ) Fix https://github.com/pytorch/pytorch/issues/133242 . In that issue, inductor fuses 2 nodes because they access the same scalar tensor. This saving is very small (4 bytes), and if we ignore that, by default, we can not fuse. But if loop ordering after fusion get kicked in, we can reorder loops and fuse those 2 nodes. We get 33% memory bandwidth savings . I think adding a threshold for membw saving in general is not bad. I'll run a perf test. ( https://github.com/pytorch/pytorch/actions/runs/11375421752 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136782 Approved by: https://github.com/jansel	2024-10-22 00:50:00 +00:00
PyTorch MergeBot	3b186c5659	Revert "[AOTI] Fix test_index_put_with_none_index_cpu_with_stack_allocation (#138303 )" This reverts commit 1417b2cd0562e0e4d4349024ef7c731b99214890. Reverted https://github.com/pytorch/pytorch/pull/138303 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/138303#issuecomment-2427991065))	2024-10-22 00:46:48 +00:00
wz337	d7e0e1dbc4	[DeviceMesh] Use `split_group` to create sub_groups for nccl backend if the default pg is eagerly initialized (#138129 ) Use `split_group()` to create sub_groups for nccl backend if the default pg is eagerly initialized. Otherwise, it will still go through the normal lazy init process and call `new_group()` instead. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138129 Approved by: https://github.com/kwen2501	2024-10-22 00:00:05 +00:00
Matthew Francis-Landau	a7f49de485	Fixes issue with enums in a tuple for dynamo (#133123 ) Currently when tuples values are encountered in dynamo, they are encoded using `repr(arg)`. This causes an issue if one of the values inside of the tuple will not be properly encoded. In this case, if an enum is contained inside of a tuple, it will cause invalid python code to be generated Pull Request resolved: https://github.com/pytorch/pytorch/pull/133123 Approved by: https://github.com/jansel	2024-10-21 23:45:11 +00:00
Mikayla Gawarecki	e24871eb3c	Add environment variable to force no weights_only load (#138225 ) In preparation for `weights_only` flip, if users don't have access to the `torch.load` call Pull Request resolved: https://github.com/pytorch/pytorch/pull/138225 Approved by: https://github.com/albanD	2024-10-21 23:26:15 +00:00
Will Feng	ec4ce094b2	[Traceable FSDP2][CI] Skip more tests on rocm (#138497 ) Some of the test checks doesn't work well with rocm. Fixes https://github.com/pytorch/pytorch/issues/138409. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138497 Approved by: https://github.com/fduwjj	2024-10-21 23:11:01 +00:00
Animesh Jain	77868697b7	[inductor][subgraph] Add size asserts (#138424 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138424 Approved by: https://github.com/eellison ghstack dependencies: #137555	2024-10-21 22:43:49 +00:00
Parikshit Shah	853da168fc	[AC] Backward Pass Aware AC - adding hooks to partitioner to pass callable (#137785 ) Summary: same as title. Plan is to pass a callable to the partitioner to perform custom autoAC via an ILP. This is the same as a previous diff D63714905 which was landed and then subsequently reverted by PyTorch Release Engineering because of a failing unit test (`f7b8d36c28`). We think the unit test is buggy, and we also fix the same. Test Plan: tbd Pull Request resolved: https://github.com/pytorch/pytorch/pull/137785 Approved by: https://github.com/basilwong Co-authored-by: Huy Do <huydhn@gmail.com>	2024-10-21 21:45:13 +00:00
Bob Ren	20a2d39557	Log all failing test repros to scuba (#138394 ) This has the benefit that 1) It's much easier to aggregate test failure repros into say a CSV or shell script from scuba 2) We can do analysis (eg. set different two sets of tests across two PRs) 3) We can get results faster at the test-level granularity instead of job-level granularity we see in the HUD/GH. I tested this by introducing a breaking change, adding ci-scribe label and then verifying that the failed tests were logged to scuba: https://fburl.com/scuba/torch_open_source_signpost/w6qt7qr9 I then reverted the breaking change and published this PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138394 Approved by: https://github.com/ezyang	2024-10-21 21:35:47 +00:00
mwlon	ef52bbbf23	More appropriate socket errors and debug messages (#130347 ) Fixes #128998 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130347 Approved by: https://github.com/fduwjj	2024-10-21 21:28:40 +00:00
Ke Wen	364340c7ee	[Forward Fix][PGNCCL] Add define guard for NCCL_SPLIT_NOCOLOR (#138488 ) Forward fix for build issue introduced by #137855: ``` In file included from fbcode/caffe2/torch/csrc/distributed/c10d/NCCLUtils.cpp:2: fbcode/caffe2/torch/csrc/distributed/c10d/ProcessGroupNCCL.hpp:508:21: error: use of undeclared identifier 'NCCL_SPLIT_NOCOLOR' 508 \| int split_color{NCCL_SPLIT_NOCOLOR - 1}; \| ^ ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/138488 Approved by: https://github.com/fduwjj ghstack dependencies: #137855	2024-10-21 21:14:20 +00:00
Joel Schlosser	134f6cda7e	Support record_stream() for NJT (#137099 ) Does what it says on the tin. I believe the right behavior here is to ensure that `record_stream()` is called on all tensor components of the NJT to ensure they all live until stream computation is complete. This is an ask from torchrec as the op is used there. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137099 Approved by: https://github.com/ngimel	2024-10-21 21:10:42 +00:00
Richard Barnes	70ec86d754	Remove C10_DEPRECATED (#138406 ) Looking in the code I see ``` // NB: __cplusplus doesn't work for MSVC, so for now MSVC always uses // the "__declspec(deprecated)" implementation and not the C++14 // "[[deprecated]]" attribute. We tried enabling "[[deprecated]]" for C++14 on // MSVC, but ran into issues with some older MSVC versions. ``` But looking at the [MSVC C++ support table](https://learn.microsoft.com/en-us/cpp/overview/visual-cpp-language-conformance?view=msvc-170) I see that the `[[deprecated]]` attribute is supported as of MSVC 2015 and that the vast majority of C++17 features became supported in MSVC 2015 _or later_. Since PyTorch is C++17 now, I infer that PyTorch must not support versions of MSVC earlier than MSVC 2015, so the versions of MSVC supported by PyTorch must support `[[deprecated]]`. Therefore, since we are finished deprecating old MSVCs we can deprecate `C10_DEPRECATED`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138406 Approved by: https://github.com/cyyever, https://github.com/malfet	2024-10-21 20:57:27 +00:00
David Berard	bb2e090b7d	[user triton] typing triton_kernel_wrap.py (#138230 ) Remove `# mypy: allow-untyped-defs` from triton_kernel_wrap.py, and fixed all the mypy errors. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138230 Approved by: https://github.com/oulgen, https://github.com/Skylion007	2024-10-21 20:34:49 +00:00
atalman	60081c29ec	Use cuda 12.4 pytorch_extra_install_requirements as default (#138458 ) Since cuda 12.4 binaries are default binaries on pypi now. The pytorch_extra_install_requirements need to use 12.4. This would need to be cherry-picked to release 2.5 branch to avoid injecting these versions into metadata during pypi promotion. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138458 Approved by: https://github.com/malfet	2024-10-21 20:16:37 +00:00
Sam Ginzburg	c1ead6fba3	Bugfix for passing None args to user defined Triton kernel (#138472 ) add test fewer failing tests more tests passing tests passing lint Pull Request resolved: https://github.com/pytorch/pytorch/pull/138472 Approved by: https://github.com/aakhundov	2024-10-21 20:00:04 +00:00
Tom Ritchford	8ad191ae21	[dynamo] Replace __str__ with __repr__ in some places (#136316 ) ## The problem In a typical debugger, `repr()` is used to display variables and not `str()`. Several classes in Dynamo have a `__str__()` method that returns useful information and a `__repr__()` that does not. Having to call `str(x)` or `[str(i) for i in x]` in the debugger all the time is a chore. `str()` should be ["informal, nicely printable"](https://docs.python.org/3/library/stdtypes.html#str) and `repr()` should ["attempt to return a string that would yield an object with the same value when passed to eval()](https://docs.python.org/3/library/functions.html#repr)". ## The solution In the Python object model, if there is no `__str__` method, `__repr__` is used instead (but not the other way around). So renaming `__str__` to `__repr__` in a few cases where no `__repr__` method exists now should not change observable behavior, and should make debugging easier. The specific classes changed were all in `torch._dynamo.variables`: * `builtin.BuiltinVariable` * `constant.ConstantVariable` * `constant.EnumVariable` * `functions.UserMethodVariable` * `lazy.LazyVariableTracker` * `lazy.LazySymNodeFormatString` * `misc.GetAttrVariable` * `misc.NullVariable` * `user_defined.UserDefinedObjectVariable` Pull Request resolved: https://github.com/pytorch/pytorch/pull/136316 Approved by: https://github.com/XuehaiPan, https://github.com/jansel	2024-10-21 19:50:38 +00:00
Huy Do	41f7d01ccf	Increase Docker push timeout limit from 15 to 30m (#138487 ) Some images now take more than 15 to finish pushing and keep timing out, for example, https://github.com/pytorch/pytorch/actions/runs/11442231435/job/31832143440 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138487 Approved by: https://github.com/kit1980, https://github.com/atalman, https://github.com/ZainRizvi	2024-10-21 19:44:52 +00:00
PyTorch MergeBot	32d4582e02	Revert "[BE]: Update Typeguard to TypeIs for better type inference (#133814 )" This reverts commit 16caa8c1b3a02e47b5f52d3c2d40d7931cc427dc. Reverted https://github.com/pytorch/pytorch/pull/133814 on behalf of https://github.com/jeanschmidt due to checking if this will solve inductor errors ([comment](https://github.com/pytorch/pytorch/pull/133814#issuecomment-2427565425))	2024-10-21 19:40:58 +00:00
Xuehai Pan	ff2f751bfb	[tools] fix nightly pull tool when the conda environment not exists (#138448 ) Now, `conda env remove --name env` exits with errors if the given environment does not exist. This PR check the existance of the environment before trying to remove it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138448 Approved by: https://github.com/ezyang	2024-10-21 19:35:48 +00:00
PyTorch MergeBot	071f6f2de8	Revert "[ROCm] Fix ADDMM hipBLASLt regression (#138267 )" This reverts commit 14a3e12985e4550440a8a1755d3418e9b02b4950. Reverted https://github.com/pytorch/pytorch/pull/138267 on behalf of https://github.com/jeffdaily due to this PR went to far when partially reverting #137604; the env var default should be the same on ROCm and CUDA ([comment](https://github.com/pytorch/pytorch/pull/138267#issuecomment-2427550465))	2024-10-21 19:33:13 +00:00
Xuehai Pan	abbd71d29d	[BE][Easy] enable PYFMT for `torch.fx` (#138443 ) Reproduce command: ```bash ghstack checkout https://github.com/pytorch/pytorch/pull/138443 git checkout HEAD~1 torch/ lintrunner -a --take "PYFMT" --all-files ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/138443 Approved by: https://github.com/ezyang	2024-10-21 19:15:49 +00:00
Animesh Jain	8231180147	[dynamo][refactor] Refactor Wrap HOP to reuse it for invoke_subgraph (#137555 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137555 Approved by: https://github.com/zou3519	2024-10-21 18:26:29 +00:00
Justin Chu	c6609ece84	[ONNX] Remove deprecated export_to_pretty_string (#137790 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137790 Approved by: https://github.com/titaiwangms, https://github.com/xadupre ghstack dependencies: #137789	2024-10-21 18:17:48 +00:00
Aaron Orenstein	07cc4bd3e2	typing compile_fx.py (#138033 ) Type annotations for compile_fx. - Some of the stuff here is pretty complicated (functions which return functions that take functions) so I bailed on those and used `Any` just to get the rest landed. - There are also changes to type signatures in other files which I did just to let mypy know more about the types in compile_fx.py. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138033 Approved by: https://github.com/Skylion007	2024-10-21 18:14:59 +00:00
Will Feng	81738403a2	[Distributed] Fix extra context on device 0 (#135273 ) This PR contains multiple fixes for issue https://github.com/pytorch/pytorch/issues/135279: ## First part: Moves the GPU guard (`cudaSetDevice`) before the `currentStreamCaptureStatusMayInitCtx` call. As its name suggests, it May Init Ctx. ## Second part: Even with the above fix, additional contexts are still observed during Work object destruction, e.g. ``` work = dist.all_reduce(tensor, async_op=True) time.sleep(5) <-- no additional context yet del work <-- additional context shows up ``` ### Debug process Chasing it down to destruction of a `Future` object -- a member variable of `Work`. Then further down to the following member of `Future`: ``` std::vector<c10::Event> events_; ``` When the `events_` are destroyed, we hit the road down to: `1f3a793790/c10/cuda/impl/CUDAGuardImpl.h (L106-L121)` When there is no "preset" CUDA context (which is the case for python garbage collector), line 112: `c10::cuda::GetDevice(&orig_device)` will set `orig_device` to 0. Then, at line 120, `c10::cuda::SetDevice(orig_device)` will "officially" set the context to device 0 -- that's where rank 1, 2, ... can create extra context on device 0! ### Solution This PR adds an explicit destructor to `Future`. In this destructor, destroy each event with a device guard. ## Test Added test_extra_cuda_context, implemented via - `pynvml` (if available), or - memory consumption check. `python test/distributed/test_c10d_nccl.py -k test_extra_cuda_context` Pull Request resolved: https://github.com/pytorch/pytorch/pull/135273 Approved by: https://github.com/fduwjj, https://github.com/wconstab, https://github.com/eqy ghstack dependencies: #137161 Co-authored-by: Will Feng <yf225@cornell.edu>	2024-10-21 17:52:21 +00:00
Justin Chu	6e38c87ad0	[ONNX] Remove ExportTypes (#137789 ) Remove deprecated ExportTypes and the `_exporter_states` module. Only protobuf (default) is supported going forward. Differential Revision: [D64412947](https://our.internmc.facebook.com/intern/diff/D64412947) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137789 Approved by: https://github.com/titaiwangms, https://github.com/xadupre	2024-10-21 17:50:28 +00:00
FFFrog	af0bc75460	Remove deprecated alias macro(1/3) (#137556 ) Detailed Descriptions: - Remove AT_ERROR Macro Pull Request resolved: https://github.com/pytorch/pytorch/pull/137556 Approved by: https://github.com/ezyang	2024-10-21 17:32:32 +00:00
Aaron Gokaslan	16caa8c1b3	[BE]: Update Typeguard to TypeIs for better type inference (#133814 ) Uses TypeIs instead of TypeGuard for better inference. See https://peps.python.org/pep-0742/ Pull Request resolved: https://github.com/pytorch/pytorch/pull/133814 Approved by: https://github.com/ezyang	2024-10-21 17:20:06 +00:00
PyTorch MergeBot	9bb327bfc6	Revert "[AC] Backward Pass Aware AC - adding hooks to partitioner to pass callable (#137785 )" This reverts commit a8b912f39d36bd2e6d204808d866439d0075f1a5. Reverted https://github.com/pytorch/pytorch/pull/137785 on behalf of https://github.com/ezyang due to breaks lint ([comment](https://github.com/pytorch/pytorch/pull/137785#issuecomment-2427295668))	2024-10-21 17:18:56 +00:00
Ryan Guo	02dd3b8e32	[dynamo][NFC] Remove unused method `InliningInstructionTranslator.check_replace_is_safe` (#137906 ) This method was no longer needed after #113725; the checking logic is now in `SideEffects.check_allowed_side_effect`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137906 Approved by: https://github.com/Skylion007, https://github.com/anijain2305 ghstack dependencies: #137905	2024-10-21 16:43:34 +00:00
Catherine Lee	1032ce6bd3	Only upload test/test-reports as artifacts (#138019 ) Fixes https://github.com/pytorch/pytorch/issues/137851 This is possibly too restrictive but I spot checked and I don't think any of the files outside of test/test-reports are important, but I can't guarantee that someone was putting something elsewhere and expecting for it to still be zipped Outputs can be see on HUD by clicking show artifacts Some examples: Logs <img width="293" alt="image" src="https://github.com/user-attachments/assets/9a2db9b1-0f62-4209-909b-4f56a908619d"> XMLs <img width="234" alt="image" src="https://github.com/user-attachments/assets/a639fe38-a112-4ea5-abba-ad1d5b25bb43"> JSONs <img width="180" alt="image" src="https://github.com/user-attachments/assets/be7a49ac-5258-4bc5-981d-3f134ebd343d"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/138019 Approved by: https://github.com/huydhn, https://github.com/malfet, https://github.com/ZainRizvi	2024-10-21 16:43:30 +00:00
Ryan Guo	0a4197490c	Delay mul/pow expansion for `_SympyT` to enable more folding (#138235 ) Instead of calling `safe_expand` right after symbolic expression construction, we invoke it in `ShapeEnv.simplify`. This enables more simplification with product form, e.g., ``` (a + b)^2 / (a + b) --> (a + b) ``` which won't happen if we expand eagerly during product construction: ``` (a^2 + 2ab + b^2) / (a + b) --> no change ``` Fixes #136044. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138235 Approved by: https://github.com/ezyang	2024-10-21 16:38:47 +00:00
David Berard	701ddf962a	[inductor] Preserve metadata across replace_by_example and register_replacement patterns (#138089 ) replace_by_example is used to implement some pattern-matching passes in inductor. Previously, replace_by_example would generate nodes with very little metadata. In particular, `meta["original_aten"]` would be lost; that meant that when generating triton kernel names, you could get empty names like `triton_tem_fused_0` if the input nodes to the fused kernel were the result of a pattern-matching pass that used replace_by_example. This also adds metadata for to register_replacement patterns, including pad_mm. This fixes the issue by copying metadata from the original node to the replacement nodes. If there are multiple original nodes we skip the metadata transfer; so if you have a `add(z, mm(x, y))`, then the metadata won't be transferred right now. Differential Revision: [D64480755](https://our.internmc.facebook.com/intern/diff/D64480755) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138089 Approved by: https://github.com/aakhundov	2024-10-21 16:33:12 +00:00
Yuanhao Ji	279ddfc6ee	Add type check for `dilation` in `torch.quantized_max_pool3d()` (#137845 ) Fixes #136716 repro: ```python import torch input = torch.randn([1, 1, 1, 1, 1]) input = torch.quantize_per_tensor(input, 0.1, 10, torch.qint32) torch.quantized_max_pool3d(input, (1, 1, 1), (1, 1, 1), (0, 0, 0), (-3, 1, 1)) # crash input = torch.randn([1, 1, 1, 1, 1]) input = torch.quantize_per_tensor(input, 0.1, 10, torch.qint32) result = torch.nn.functional.max_pool3d(input, (1, 1, 1), (1, 1, 1), (0, 0, 0), (-3, 1, 1)) # crash ``` result: ``` RuntimeError: Expected dilation >= 1 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/137845 Approved by: https://github.com/albanD	2024-10-21 16:15:57 +00:00
Parikshit Shah	a8b912f39d	[AC] Backward Pass Aware AC - adding hooks to partitioner to pass callable (#137785 ) Summary: same as title. Plan is to pass a callable to the partitioner to perform custom autoAC via an ILP. This is the same as a previous diff D63714905 which was landed and then subsequently reverted by PyTorch Release Engineering because of a failing unit test (`f7b8d36c28`). We think the unit test is buggy, and we also fix the same. Test Plan: tbd Differential Revision: D64246495 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137785 Approved by: https://github.com/basilwong	2024-10-21 15:30:07 +00:00
cyy	7ec21a6f0f	Enable clang-tidy on torch/csrc/api (#138437 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/138437 Approved by: https://github.com/r-barnes	2024-10-21 14:22:38 +00:00
FFFrog	8aacbee8e0	Make Context to be Device-agnostic Step by Step (2/N) (#136526 ) ---- - add new method(getDefaultGenerator, getNewGenerator) into AcceleratorHooksInterface Pull Request resolved: https://github.com/pytorch/pytorch/pull/136526 Approved by: https://github.com/ezyang, https://github.com/EikanWang ghstack dependencies: #138323	2024-10-21 13:51:54 +00:00
FFFrog	649f8117ad	Add deprecated warning for lazyInitXXX API (#138323 ) Detailed Descriptions: Involved APIs are as followed: - ``lazyInitCUDA`` - ``lazyInitHIP`` - ``lazyInitXPU`` - ``lazyInitMTIA`` - ``lazyInitPrivateUse1`` Pull Request resolved: https://github.com/pytorch/pytorch/pull/138323 Approved by: https://github.com/malfet	2024-10-21 13:51:54 +00:00
Bin Bao	1417b2cd05	[AOTI] Fix test_index_put_with_none_index_cpu_with_stack_allocation (#138303 ) Summary: The problem happened after splitting CppWrapperCpu and CppWrapperCpuArrayRef, because CppWrapperCpuArrayRef.generate_index_put_fallback missed a statement. Running test_aot_inductor.py as a whole didn't reveal the problem, but running test_index_put_with_none_index_cpu_with_stack_allocation individually did. Digging deeper, the root cause is init_backend_registration has incorrectly cached CPU CppWrapperCodegen class, which means CppWrapperCpuArrayRef was never picked when running test_aot_inductor.py as a whole. Differential Revision: [D64598714](https://our.internmc.facebook.com/intern/diff/D64598714) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138303 Approved by: https://github.com/hl475	2024-10-21 13:47:50 +00:00
PyTorch UpdateBot	8f3efb8797	Update slow tests (#133203 ) This PR is auto-generated weekly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/weeekly.yml). Update the list of slow tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133203 Approved by: https://github.com/pytorchbot	2024-10-21 12:00:52 +00:00
cyy	14fc6b70ea	Remove torch/csrc/api/include/torch/linalg.h (#138435 ) Only one place in OSS uses it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138435 Approved by: https://github.com/r-barnes	2024-10-21 07:04:27 +00:00
Xiaodong Wang	5f940a44af	[AMD] Fix torch ck backend build with 6.2.1 (#138434 ) Summary: It's complaining about missing __hip_bfloat162 definition w/o this header. Differential Revision: D64673284 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138434 Approved by: https://github.com/yaoyj11, https://github.com/houseroad	2024-10-21 06:38:38 +00:00
Will Feng	362ca54f03	[c10d][Partial-Graph Overlap] Support calling .wait_tensor() within compiled region on output tensor of eager `async_op=True` collective (#137763 ) This PR aims to support the following use case: ```python def all_reduce_eager(x): y = x * x req = dist.all_reduce(y, op=dist.ReduceOp.SUM, async_op=True) assert isinstance(req, torch.distributed.Work) return y @torch.compile(fullgraph=True) def all_reduce_wait_compiled(y): torch.ops.c10d_functional.wait_tensor(y) return y * y ``` where the collective is issued in eager (with `async_op=True`) but waited in compiled region. This is important for internal use cases such as TorchRec, where we issue collectives in eager for SparseArch all_to_all but want to wait for them in compiled region at beginning of OverArch, so that the all_to_all can be overlapped with the DenseArch compute that runs in parallel. ------ Test commands: - `pytest -rA test/distributed/test_inductor_collectives.py::TestCollectivesMultiProc::test_eager_async_allreduce_inductor_wait` - `pytest -rA test/test_fx.py::TestDCE::test_keep_collectives` - `pytest -rA test/test_fx.py::TestDCE::test_keep_collectives_no_overload` - `pytest -rA test/distributed/test_c10d_functional_native.py::TestWithNCCL::test_unwaited` - `pytest -rA test/distributed/test_c10d_functional_native.py::TestWithNCCL::test_work_registry` - `pytest -rA test/distributed/test_c10d_nccl.py::CommTest::test_unwaited` - `pytest -rA test/distributed/test_c10d_nccl.py::CommTest::test_work_registry` - `pytest -rA test/distributed/_tensor/test_tensor_ops.py::DistTensorOpsTest::test_equal` - `pytest -rA test/distributed/_tensor/test_random_ops.py::DistTensorRandomOpTest::test_manual_seed` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_ddp_baseline_aot_eager_multiprocess` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_aot_eager` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_setattr` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_unspecialized_forced_getattr_inline` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_unspecialized_forced_getattr_no_inline` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_asymmetric_compilation` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_compiler_collectives_automatic_dynamic_scalar` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_compiler_collectives_automatic_dynamic_speculation_divergence` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_compiler_collectives_automatic_dynamic_tensor` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_compiler_collectives_dim_mismatch` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_compiler_collectives_graph_break_empty_graph_still_collective` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_compiler_collectives_missing_source` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_compiler_collectives_scalar_missing_source` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_compiler_collectives_type_mismatch` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_ddp_activation_checkpointing` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_ddp_baseline_aot_eager_multiprocess` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_activation_checkpointing` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_aot_eager` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_inductor` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_setattr` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_unspecialized_forced_getattr_inline` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_unspecialized_forced_getattr_no_inline` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_hf_bert_ddp_aot_eager` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_hf_bert_ddp_aot_eager_static_graph` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_hf_bert_ddp_inductor` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_hf_bert_ddp_inductor_static_graph` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_hf_bert_fsdp_activation_checkpointing` - `pytest -rA test/distributed/_tensor/test_experimental_ops.py::DistOtherOpsTest::test_bernoulli` - `pytest -rA test/distributed/_tensor/test_dtensor_compile.py::TestDTensorCompileE2E::test_tp_compile_fullgraph_is_seq_parallel_True` - `pytest -rA test/distributed/test_inductor_collectives.py::TestCollectivesMultiProc::test_allreduce_inductor_cudagraph_trees` - `python benchmarks/dynamo/torchbench.py --ci --accuracy --timing --explain --inductor --device cuda --inference --bfloat16 --total-partitions 2 --partition-id 1 --output inference_torchbench.csv --only moco` ------ Differential Revision: [D64511994](https://our.internmc.facebook.com/intern/diff/D64511994) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137763 Approved by: https://github.com/yifuwang	2024-10-21 06:02:57 +00:00
cyy	a170ff4167	Prepare to enable ASAN on CUDA (#138404 ) See which tests fail Pull Request resolved: https://github.com/pytorch/pytorch/pull/138404 Approved by: https://github.com/ezyang	2024-10-21 03:55:29 +00:00
Richard Barnes	9ad2736627	Remove extraneous C++14 comment (#138408 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138408 Approved by: https://github.com/Skylion007	2024-10-21 03:54:41 +00:00
PyTorch MergeBot	6987bfb40a	Revert "[dynamo][NFC] Remove unused method `InliningInstructionTranslator.check_replace_is_safe` (#137906 )" This reverts commit 3c7d9d6c7fa565e811675be7dd84e5ef7c8ba7a0. Reverted https://github.com/pytorch/pytorch/pull/137906 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/137906#issuecomment-2425505452))	2024-10-21 03:42:38 +00:00
wz337	fb0da32377	[DeviceMesh] Small refactor to optimize DeviceMesh subgroup creation (#138117 ) As `backend`, `pg_options`, and `group_desc` are the same for each mesh dimension, we don't need to get or create these args for `new_group` multiple times. This PR moves it from the inner loop of the subgroup creation (each subgroup ranks of each mesh dimension) to the outer loop (each mesh_dimension). For example, given we have a 2 * 4 DeviceMesh, we are re-creating the variables `backend`, `pg_options`, and `group_desc` 2*4 = 8 times. After the change, we only create these variables once per mesh dimension, which is 2 times. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138117 Approved by: https://github.com/kwen2501	2024-10-21 03:04:24 +00:00
cyy	a05b64a38f	[5/N] Fix extra warnings brought by clang-tidy-17 (#138403 ) Follows #137983 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138403 Approved by: https://github.com/ezyang	2024-10-21 02:59:54 +00:00
cyy	82eb09aafd	[Environment Variable][4/N] Use thread-safe getenv functions (#137843 ) Follows #137328 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137843 Approved by: https://github.com/ezyang	2024-10-21 02:58:59 +00:00
Shuqiang Zhang	2d3455e7d9	[c10d] try fix the unstableness of test_get_future_result (#138415 ) Summary: Seems depends on the platform, nccl error or timeout would be raised first on rank 0. Now we try to force the timeout by not exiting other ranks Test Plan: Tests pass locally Tags: Fixes https://github.com/pytorch/pytorch/issues/138397 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138415 Approved by: https://github.com/kwen2501	2024-10-21 01:17:30 +00:00
cyy	e7b8a9a4c1	[5/N] Fix clang-tidy warnings in torch/csrc/api/ (#138389 ) Follows #138382 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138389 Approved by: https://github.com/ezyang	2024-10-21 01:12:37 +00:00
Will Feng	e4ad02892f	Upgrade distributed test to g4dn instances (T4 GPUs) (#137161 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137161 Approved by: https://github.com/seemethere, https://github.com/eqy, https://github.com/yf225 Co-authored-by: Will Feng <yf225@cornell.edu>	2024-10-20 23:48:54 +00:00
Isuru Fernando	4f45a052ad	Fix try_solve for s1*s2 == 0 when both symbols are unknown (#137919 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137919 Approved by: https://github.com/ezyang	2024-10-20 23:33:08 +00:00
Alnis Murtovi	09cf163ae3	Fix for mixed_mm tests failures on SM70 and lower (#138183 ) This PR fixes mixed_mm tests that are failing on SM70 and lower as discussed here https://github.com/pytorch/pytorch/pull/123762#issuecomment-2406601729. The failure occurs because some of the mixed_mm tests expect triton code to be generated, but on SM70 and lower, the generation of triton code is skipped (see https://github.com/pytorch/pytorch/blob/main/torch/_inductor/kernel/mm.py#L693). These tests will now be skipped when running on SM70 and lower. I do not have access to an SM70 GPU, so I was not able to test these changes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138183 Approved by: https://github.com/ezyang	2024-10-20 21:14:31 +00:00
PyTorch MergeBot	a1899b5a9e	Revert "[Environment Variable][4/N] Use thread-safe getenv functions (#137843 )" This reverts commit 239ad73cb1c8a91f0a2de21d27af3d98f5a8dddc. Reverted https://github.com/pytorch/pytorch/pull/137843 on behalf of https://github.com/yf225 due to Sorry for reverting your PR but I believe this PR breaks the binary builds. Example: https://ossci-raw-job-status.s3.amazonaws.com/log/31790258895, with error message: `getenv is not a member of c10::utils`, might be easier to search for `not a member of` in the log ([comment](https://github.com/pytorch/pytorch/pull/137843#issuecomment-2425192780))	2024-10-20 19:48:14 +00:00
Will Feng	a9f4f89cd5	[CI] Add Compiled DDP / Compiled FSDP2 / compute-comm reordering tests to test_inductor_distributed (#138178 ) `test_replicate_with_compiler.py` and `test_fully_shard_compile.py` requires bf16, so needs to be run within test_inductor_distributed job (which uses A10G (SM80) and has bf16 support). This allows us to migrate distributed jobs to T4 machines in https://github.com/pytorch/pytorch/pull/137161, as the compiled distributed jobs are the only blocking ones now. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138178 Approved by: https://github.com/xmfan, https://github.com/fduwjj, https://github.com/fegin, https://github.com/kwen2501	2024-10-20 19:38:18 +00:00
cyy	239ad73cb1	[Environment Variable][4/N] Use thread-safe getenv functions (#137843 ) Follows #137328 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137843 Approved by: https://github.com/ezyang	2024-10-20 13:05:04 +00:00
drisspg	07fd61e106	[SDPA] Fix warning message (#138278 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138278 Approved by: https://github.com/eqy, https://github.com/Skylion007	2024-10-20 08:00:56 +00:00
Huy Do	f568d48890	Enable git long paths checkout on Windows (#138411 ) Checking out PyTorch on Windows starts to fail after ROCm change https://github.com/pytorch/pytorch/pull/131004 in which one of the submodule path, `third_party/composable_kernel`, is getting too long https://hud.pytorch.org/pr/pytorch/pytorch/131004#31778700376 According to https://github.com/actions/checkout/issues/1285, there is no fix in GHA checkout, but we can set `git config --system core.longpaths true` to enable long paths support in Git as a workaround. ### Testing Windows checkout is ok now https://github.com/pytorch/pytorch/actions/runs/11423112351/job/31781916540 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138411 Approved by: https://github.com/wdvr	2024-10-20 07:18:44 +00:00
PyTorch MergeBot	f8303740f7	Revert "Enable git long paths checkout on Windows (#138411 )" This reverts commit 12283035f8c08cd3487bfaac25ccef7da90952ba. Reverted https://github.com/pytorch/pytorch/pull/138411 on behalf of https://github.com/huydhn due to Opps, I forgot Windows binary build, let me revert and reland this one ([comment](https://github.com/pytorch/pytorch/pull/138411#issuecomment-2424661640))	2024-10-20 06:50:48 +00:00
Huy Do	12283035f8	Enable git long paths checkout on Windows (#138411 ) Checking out PyTorch on Windows starts to fail after ROCm change https://github.com/pytorch/pytorch/pull/131004 in which one of the submodule path, `third_party/composable_kernel`, is getting too long https://hud.pytorch.org/pr/pytorch/pytorch/131004#31778700376 According to https://github.com/actions/checkout/issues/1285, there is no fix in GHA checkout, but we can set `git config --system core.longpaths true` to enable long paths support in Git as a workaround. ### Testing Windows checkout is ok now https://github.com/pytorch/pytorch/actions/runs/11423112351/job/31781916540 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138411 Approved by: https://github.com/wdvr	2024-10-20 06:32:34 +00:00
PyTorch MergeBot	d1027c2be6	Revert "Update sympy version constraint to 1.13.3 (#138338 )" This reverts commit d8279ad9d162b5ce71699f462d3664c3745b14f5. Reverted https://github.com/pytorch/pytorch/pull/138338 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but I think a bunch of inductor tests and test_dynamic_shapes are failing in trunk after this lands `d8279ad9d1` ([comment](https://github.com/pytorch/pytorch/pull/138338#issuecomment-2424487225))	2024-10-20 03:19:02 +00:00
Jeff Daily	3f3b692a00	[ROCm] CK-based GEMM (#131004 ) - composable_kernel as a third_party submodule - "ck" as a `torch.backends.cuda.preferred_linalg_library()` - reference CK gemm implementations for float, bfloat16, and half types Pull Request resolved: https://github.com/pytorch/pytorch/pull/131004 Approved by: https://github.com/xw285cornell, https://github.com/pruthvistony Co-authored-by: Andres Lugo <Andy.LugoReyes@amd.com> Co-authored-by: Pruthvi Madugundu <pruthvigithub@gmail.com>	2024-10-20 02:57:43 +00:00
Animesh Jain	0a2407b93c	[dynamo] Support omegaconf DictConfig (#138378 ) Fixes https://github.com/pytorch/pytorch/issues/138224 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138378 Approved by: https://github.com/jansel ghstack dependencies: #138359	2024-10-20 02:43:17 +00:00
Animesh Jain	f892543c1f	[dynamo] Support TypedDict (#138359 ) Seen in vLLM. Fixes https://github.com/pytorch/pytorch/issues/132629 Fixes https://github.com/pytorch/pytorch/issues/133613 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138359 Approved by: https://github.com/jansel Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>	2024-10-20 02:43:17 +00:00
cyy	1f349eed61	[4/N] Fix extra warnings brought by clang-tidy-17 (#137983 ) Follows #137552 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137983 Approved by: https://github.com/ezyang, https://github.com/Skylion007	2024-10-20 01:02:33 +00:00
Richard Barnes	b1b7c714ed	Add deprecated C10_UNUSED and C10_NODISCARD macros back (#138398 ) For backwards compatibility. Disallow internal use. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138398 Approved by: https://github.com/malfet	2024-10-20 00:21:19 +00:00
Jeongseok (JS) Lee	d8279ad9d1	Update sympy version constraint to 1.13.3 (#138338 ) `simpy` was pinned to version 1.13.1 due to test failures with version 1.13.2 on Windows and mac, as reported in https://github.com/pytorch/pytorch/pull/133235. Now that a newer version, 1.13.3, has been released, this PR aims to verify if the test failure has been resolved and also allow building with newer versions for packaging purposes (e.g., https://github.com/conda-forge/pytorch-cpu-feedstock/pull/277#discussion_r1806721862). Pull Request resolved: https://github.com/pytorch/pytorch/pull/138338 Approved by: https://github.com/Skylion007, https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2024-10-20 00:20:02 +00:00
Nichols A. Romero	14a3e12985	[ROCm] Fix ADDMM hipBLASLt regression (#138267 ) Fixes #138067 A partial reversion of this PR: https://github.com/pytorch/pytorch/pull/137604 The breakage is on AMD GPUs that do not fully support hipBLASLt, e.g. gfx1100 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138267 Approved by: https://github.com/malfet	2024-10-20 00:19:10 +00:00
PyTorch MergeBot	47e80abc7a	Revert "[inductor] Preserve metadata across replace_by_example and register_replacement patterns (#138089 )" This reverts commit fb44658415e50b5be6a187ff3f14243c0fdf3daf. Reverted https://github.com/pytorch/pytorch/pull/138089 on behalf of https://github.com/huydhn due to Sorry for reverting your PR but the new test_original_aten_preserved_pad_mm test runs OOM in trunk `fb44658415` ([comment](https://github.com/pytorch/pytorch/pull/138089#issuecomment-2424297269))	2024-10-19 23:55:01 +00:00
Will Feng	fcedf93d1e	[Traceable FSDP2] Add `_compiled_autograd_enabled` global state variable (#138187 ) After https://github.com/pytorch/pytorch/pull/137821, we will no longer be able to call the Compiled Autograd state getter under Dynamo tracing. One solution is to cache the "Compiled Autograd enabled" state outside of compile for FSDP2, and just read from the cache when we need the check. This is implemented by this PR. Fixes https://github.com/pytorch/pytorch/issues/138177. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138187 Approved by: https://github.com/xmfan, https://github.com/awgu	2024-10-19 19:10:31 +00:00
Tom Ritchford	c0582fd0f8	Remove unused Python variables in torch/[b-z]* (#136963 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136963 Approved by: https://github.com/ezyang	2024-10-19 16:45:22 +00:00
David Berard	fb44658415	[inductor] Preserve metadata across replace_by_example and register_replacement patterns (#138089 ) replace_by_example is used to implement some pattern-matching passes in inductor. Previously, replace_by_example would generate nodes with very little metadata. In particular, `meta["original_aten"]` would be lost; that meant that when generating triton kernel names, you could get empty names like `triton_tem_fused_0` if the input nodes to the fused kernel were the result of a pattern-matching pass that used replace_by_example. This also adds metadata for to register_replacement patterns, including pad_mm. This fixes the issue by copying metadata from the original node to the replacement nodes. If there are multiple original nodes we skip the metadata transfer; so if you have a `add(z, mm(x, y))`, then the metadata won't be transferred right now. Differential Revision: [D64480755](https://our.internmc.facebook.com/intern/diff/D64480755) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138089 Approved by: https://github.com/aakhundov	2024-10-19 16:37:08 +00:00
Bob Ren	38ea487338	Re-raise in _run_sympy_handler to reduce log spew (#138356 ) Fixes: https://github.com/pytorch/pytorch/issues/138069 I tested this by running `python test/inductor/test_torchinductor_dynamic_shapes.py DynamicShapesCpuTests.test_builtins_round_float_ndigits_pos_dynamic_shapes_cpu` before and after the change and verifying no more log spew. I'm uncertain on if it makes sense to add a test for this PR. Question for reviewers: is there a standard paradigm for testing these log spew based fixed? Happy to add a test if someone can point me towards the right direction. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138356 Approved by: https://github.com/ezyang	2024-10-19 16:02:45 +00:00
Nikita Shulga	c0879d0c21	Fix lint Regression casued by `fddabc6e0b` that was force merged	2024-10-19 08:33:41 -07:00
cyy	cdc9f14227	[4/N] Fix clang-tidy warnings in torch/csrc/api/ (#138382 ) Follows #138328 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138382 Approved by: https://github.com/ezyang	2024-10-19 13:32:51 +00:00
Richard Barnes	fddabc6e0b	C10_UNUSED to [[maybe_unused]] (#6357 ) (#138364 ) Summary: Pull Request resolved: https://github.com/pytorch/executorch/pull/6357 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138364 Approved by: https://github.com/Skylion007, https://github.com/eqy	2024-10-19 13:17:43 +00:00
cyy	2f6a70bfea	Enable more UBSAN checks (#138288 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/138288 Approved by: https://github.com/ezyang	2024-10-19 13:00:26 +00:00
cyy	675e16e137	[3/N] Fix clang-tidy warnings in torch/csrc/api/ (#138328 ) Follows #136998 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138328 Approved by: https://github.com/ezyang	2024-10-19 07:07:39 +00:00
PyTorch MergeBot	795255a7c8	Revert "[Traceable FSDP2] Add `_compiled_autograd_enabled` global state variable (#138187 )" This reverts commit 0c913b35aaea9ca33510239e939957ec5fe66d78. Reverted https://github.com/pytorch/pytorch/pull/138187 on behalf of https://github.com/yf225 due to linux-focal-rocm6.2-py3.10 / test (distributed, 1, 3, linux.rocm.gpu) test_compiled_autograd_ctx failed ([comment](https://github.com/pytorch/pytorch/pull/138187#issuecomment-2423609108))	2024-10-19 06:12:47 +00:00
Nikita Shulga	de16159e56	[MPS] Fix sliced cast (#138314 ) This fixes internal crash due to the invalid bufer size computation if sliced API is used Not sure what was the purpose of ```c++ IntArrayRef baseShape; if (src.is_view()) { baseShape = src._base().sizes(); } else { baseShape = getIMPSAllocator()->getBufferShape(src.storage().data()); } int flattenedShaped = 1; for (const auto i : c10::irange(baseShape.size())) { flattenedShaped *= baseShape[i]; } ``` As flattenShaped could be much easier computed as `[srcBuf lengh]/src.element_size()`, and even if `srcBuf` is padded it's a safe thing to do. When someone allocated buffer to hold say uint8 and that view-casted it to float16, attempt to compute `baseShape` returned sizes of original tensor in its data type, rather than size in new dtypes Fixes https://github.com/pytorch/pytorch/issues/137800 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138314 Approved by: https://github.com/albanD, https://github.com/DenisVieriu97	2024-10-19 05:17:09 +00:00
Will Feng	0c913b35aa	[Traceable FSDP2] Add `_compiled_autograd_enabled` global state variable (#138187 ) After https://github.com/pytorch/pytorch/pull/137821, we will no longer be able to call the Compiled Autograd state getter under Dynamo tracing. One solution is to cache the "Compiled Autograd enabled" state outside of compile for FSDP2, and just read from the cache when we need the check. This is implemented by this PR. Fixes https://github.com/pytorch/pytorch/issues/138177. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138187 Approved by: https://github.com/xmfan, https://github.com/awgu ghstack dependencies: #138245, #138174	2024-10-19 04:33:35 +00:00
Will Feng	8f118e53d7	[CI] Fix CompiledDDP failure when the gradient is not contiguous; Add Compiled DDP and Compiled FSDP2 tests to test_inductor_distributed (#138174 ) Summary: As title `test_replicate_with_compiler.py` and `test_fully_shard_compile.py` requires bf16, so needs to be run within test_inductor_distributed job (which uses A10G (SM80) and has bf16 support). This allows us to migrate distributed jobs to T4 machines in https://github.com/pytorch/pytorch/pull/137161, as the compiled distributed jobs are the only blocking ones now. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138174 Approved by: https://github.com/yf225, https://github.com/kwen2501 ghstack dependencies: #138245 Co-authored-by: Will Feng <yf225@cornell.edu>	2024-10-19 04:33:35 +00:00
Jeongseok Lee	3cfd244495	Add USE_SYSTEM_NVTX option (#138287 ) ## Summary We are currently [updating](https://github.com/conda-forge/pytorch-cpu-feedstock/pull/277) the [`conda-forge::pytorch`](https://anaconda.org/conda-forge/pytorch) package to version 2.5.0. This update includes a new dependency, the third_party/NVTX submodule. However, like other package management frameworks (e.g., apt), conda-forge prefers using system-installed packages instead of vendor-provided third-party packages. This pull request aims to add an option, `USE_SYSTEM_NVTX`, to select whether to use the vendored nvtx or the system-installed one, with the default being the vendored one (which is the current behavior). ## Test Plan The `USE_SYSTEM_NVTX` option is tested by building the `conda-forge::pytorch` package with the change applied as a [patch](`cd1d2464dd/recipe/patches/0005-Use-system-nvtx3.patch`). Pull Request resolved: https://github.com/pytorch/pytorch/pull/138287 Approved by: https://github.com/albanD	2024-10-19 04:26:01 +00:00
Michael Lazos	a20a17fd6f	[Dynamo] Disable torch function compilation during guard execution and in compiled bytecode (#137669 ) Fixes https://github.com/pytorch/pytorch/issues/114369 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137669 Approved by: https://github.com/anijain2305	2024-10-19 04:12:45 +00:00
PyTorch UpdateBot	88eb15a3e3	[audio hash update] update the pinned audio hash (#138139 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned audio hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138139 Approved by: https://github.com/pytorchbot	2024-10-19 04:02:21 +00:00
Wouter Devriendt	7d076b9e3a	updated EC2 fetching of metadata to use IMDSv2 (#138286 )	2024-10-18 20:58:47 -07:00
PyTorch MergeBot	ac7f52b301	Revert "[inductor] add a threshold for membw saving during fusion (#136782 )" This reverts commit 6647320de2077c10309f5025a007d51c7fb542d8. Reverted https://github.com/pytorch/pytorch/pull/136782 on behalf of https://github.com/huydhn due to Sorry for reverting your change but test_memory starts to fail after this lands in trunk ([comment](https://github.com/pytorch/pytorch/pull/136782#issuecomment-2423549196))	2024-10-19 03:43:42 +00:00
Ke Wen	fecd370ea1	[c10d] Fix color value for comm split being negative (#137855 ) Fixes https://github.com/pytorch/pytorch/issues/137856. ### Issue 1 Today under `ProcessGroupNCCL::Options`, color is declared as: ``` int64_t split_color{0}; ``` When passing this variable to `ncclCommSplit` which accepts `int`, the value may overflow and become negative, as in #137856. But NCCL API only accepts non-negative colors (or `NCCL_SPLIT_NOCOLOR`). But that's not all. ### Issue 2 `split_color` is pybind'ed to python frontend. If we just change from `int64_t` to `int` in C++, pybind will complain: ``` [rank0]: TypeError: (): incompatible function arguments. The following argument types are supported: [rank0]: 1. (self: torch._C._distributed_c10d.ProcessGroupNCCL.Options, arg0: int) -> None ``` This is because python `int` represents a wider range than C++ `int`. So we cannot pass hash values -- which are potentially big ints -- from python to C++. The PR modulo the hash value with `c_int`'s max value. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137855 Approved by: https://github.com/wconstab	2024-10-19 03:17:19 +00:00
Richard Barnes	542f7c8383	Eliminate C10_NODISCARD (#138336 ) Test Plan: Sandcastle Reviewed By: swolchok Pull Request resolved: https://github.com/pytorch/pytorch/pull/138336 Approved by: https://github.com/Skylion007	2024-10-19 02:54:06 +00:00
fduwjj	a4b6ef178c	[c10d] Reorder cpp stack dump and FR dump and add log prefix to loggings (#138368 ) The rationale behind this PR is to: 1. Move the dump of c++ traces after FR dump because the FR dump is timed meaning that it will not block forever, while the dumping of c++ traces is likely to be blocking. so that we swap the order. Ideally we also want to make cpp stacktrace dump to be a future wait, if we want to go down this path, we can also make it happen in an another PR. 2. Add log Prefix to the logs which have not been added. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138368 Approved by: https://github.com/c-p-i-o	2024-10-19 02:43:41 +00:00
Rachel Guo	ea412d5554	[AOTI] Fix a special case compile time data type codegen for sym int variables (#138106 ) Summary: This change unblocks the CFR AOTI lowering runtime error. TL;DR: In this model, one triton kernel expects a scalar input dtype as i64, but getting an i32. The reason is "auto" can infer a smaller data type if the variable it passed in e.g. is i32. thus cause CUDA IMA. Original problematic kernel: `triton_poi_fused_add_ge_logical_and_logical_or_lt_46_grid_100`. This diff manually cast it to i64 for all symbolic arguments in compile time for i64 triton kernel inputs, instead of use `auto var_x = {arg}` in cpp wrapper code. Test Plan: Verified in FLB locally: ``` PYTORCH_NO_CUDA_MEMORY_CACHING=1 AOT_INDUCTOR_DEBUG_INTERMEDIATE_VALUE_PRINTER=3 TORCH_LOGS="output_code" TORCHINDUCTOR_MAX_AUTOTUNE=1 TORCH_SHOW_CPP_STACKTRACES=1 CUDA_LAUNCH_BLOCKING=1 ~/fbsource/buck-out/v2/gen/fbcode/98e643f8bb44fe9d/hpc/new/models/feed/benchmark/__feed_lower_benchmark__/feed_lower_benchmark.par --skip-eager --skip-flop-estimation --lower-backend="AOT_INDUCTOR" --sync-mode=0 --precision bf16 --output-precision bf16 --lower-presets="ifr_cint;disable_new_lowering_weights;disable_dper_passes:passes=fuse_parallel_linear_no_weight_change" --remove-unexpected-type-cast=False --load="manifold://ads_storage_fblearner/tree/user/facebook/fblearner/predictor/924293663/0/gpu_lowering/input.merge"``` Differential Revision: D64490039 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138106 Approved by: https://github.com/ColinPeppler	2024-10-19 02:30:53 +00:00
Xu Han	d5035f0aab	fix codecache write_atomic path issue on Windows. (#138331 ) Fixes #138211 `Path.rename` function has Windows OS specific behavior, that will raise `FileExistsError` when the target file existing. This behavior is not happened on Linux, so I write a small repoduce code to figure out what happened. After stepping trace the repo code: ```python import os import sys from pathlib import Path _IS_WINDOWS = sys.platform == "win32" def test_case(): cwd = os.getcwd() path1 = os.path.join(cwd, "haha1.txt") path2 = Path(os.path.join(cwd, "haha2.txt")) try: path2.rename(path1) except FileExistsError as e_file_exist: if _IS_WINDOWS: # on Windows file exist is expected: https://docs.python.org/3/library/pathlib.html#pathlib.Path.rename shutil.copy2(path2, path1) os.remove(path2) else: raise e_file_exist except BaseException as e: raise e print("run here.") if __name__ == "__main__": test_case() ``` We found the code `path2.rename(path1)` can breakdown into: 1. copy file2's content to file1. 2. delete file2. So, we can implemented equal code on Windows path: ```python shutil.copy2(src=tmp_path, dst=path) os.remove(tmp_path) ``` So, we can get current PR. TODO: need cherry-pick to release/2.5 branch, CC: @atalman . Pull Request resolved: https://github.com/pytorch/pytorch/pull/138331 Approved by: https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2024-10-19 01:27:12 +00:00
Aleksei Nikiforov	949b6f685d	Enable -Werror on s390x (#136527 ) Enable -Werror on s390x Example of original issue on s390x: https://github.com/pytorch/pytorch/actions/runs/11014606340/job/30585632704 Most of warnings are not specific to s390x, but specific to gcc-13 or gcc-14. To test it on s390x an image with gcc-13 is needed. For s390x it's tested for new regressions on every merge due to trunk workflow. `-Wdangling-reference` produces either obviously false warnings or suspicious warnings, which on closer inspection look plausibly safe. `-Wredundant-move` with new gcc complains about `std::move(...)` disabling copy elision. But removing `std::move(...)` makes used clang versions complain about copying objects when they could be moved. For now also disable it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136527 Approved by: https://github.com/malfet	2024-10-19 01:18:42 +00:00
Nikita Shulga	4a3c9400fe	Update cpuinfo submodule (#138351 ) To suppress error on ARM systems where PR_SVE_GET_VL is missing Pull Request resolved: https://github.com/pytorch/pytorch/pull/138351 Approved by: https://github.com/Skylion007	2024-10-19 01:12:29 +00:00
wz337	ff598f2f4d	[DTensorTestbase] Add an optional `eager_init` flag to `with_comms()` to support eager init nccl communicator for DeviceMesh test case (#138108 ) Add an optional `eager_init` flag to `with_comms`. When `eager_init` is True and backend is `nccl`, we pass the `device_id` to `init_process_group()` for eager initialization. Otherwise, `device_id` is still `None` and this goes through the normal lazy call. Default for `eager_init` is False. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138108 Approved by: https://github.com/kwen2501	2024-10-19 01:04:55 +00:00
nihui	b3ae1b1b73	[CMake] remove duplicated cmake options for Gloo and C10D (#138318 ) just a trival fix :P cmake options from line 345 to line 357 are identical to these of line 358 to line 369, remove the duplicated lines Pull Request resolved: https://github.com/pytorch/pytorch/pull/138318 Approved by: https://github.com/janeyx99	2024-10-19 00:26:25 +00:00
Shunting Zhang	6647320de2	[inductor] add a threshold for membw saving during fusion (#136782 ) Fix https://github.com/pytorch/pytorch/issues/133242 . In that issue, inductor fuses 2 nodes because they access the same scalar tensor. This saving is very small (4 bytes), and if we ignore that, by default, we can not fuse. But if loop ordering after fusion get kicked in, we can reorder loops and fuse those 2 nodes. We get 33% memory bandwidth savings . I think adding a threshold for membw saving in general is not bad. I'll run a perf test. ( https://github.com/pytorch/pytorch/actions/runs/11375421752 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136782 Approved by: https://github.com/jansel	2024-10-19 00:22:43 +00:00
PyTorch MergeBot	e8b1409dcf	Revert "[user triton] typing triton_kernel_wrap.py (#138230 )" This reverts commit 2f61b69603756c1fcaef71b231e598df31e20f42. Reverted https://github.com/pytorch/pytorch/pull/138230 on behalf of https://github.com/wdvr due to Reverting this, as it started failing tests on main ([comment](https://github.com/pytorch/pytorch/pull/138230#issuecomment-2423354596))	2024-10-18 23:12:29 +00:00
Jason Ansel	4632594546	[inductor] Move V.graph.scheduler.current_device to V.graph.current_device (#138252 ) There are some places where it would be nice to use this, but the scheduler hasn't yet been created. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138252 Approved by: https://github.com/eellison ghstack dependencies: #138170	2024-10-18 23:05:54 +00:00
Jason Ansel	85a6a782e5	[inductor] Generalize WorkspaceArg for graph-level semaphores (#138170 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138170 Approved by: https://github.com/Chillee	2024-10-18 23:05:54 +00:00
Simon Fan	13bcb065f5	[compiled autograd] enable some reentrant tests (#137290 ) Some seem to fail due to queue_callback usage Pull Request resolved: https://github.com/pytorch/pytorch/pull/137290 Approved by: https://github.com/yf225	2024-10-18 22:25:08 +00:00
PyTorch MergeBot	47e4045566	Revert "[pt2] Log is_forward field to dynamo_compile scuba table (#138097 )" This reverts commit 4e9273c84edafdcfff57521dde6675b967181ba8. Reverted https://github.com/pytorch/pytorch/pull/138097 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but I think it has a land race with https://github.com/pytorch/pytorch/pull/137803 ([comment](https://github.com/pytorch/pytorch/pull/138097#issuecomment-2423297516))	2024-10-18 22:00:40 +00:00
Aaron Shi	bd7cbddfe3	[CODEOWNERS] Remove aaronenyeshi from Profiler paths (#138346 ) As title, remove aaronenyeshi from Profiler paths. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138346 Approved by: https://github.com/sraikund16	2024-10-18 21:46:00 +00:00
Ke Wen	c88b77af9c	[Distributed][CI] Add SM guard for compiled tests involving BF16 (#138245 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138245 Approved by: https://github.com/yf225	2024-10-18 21:39:39 +00:00
Cen Zhao	7faa1284ab	[ptd][amd] call alltoallv instead of send/recv (#136368 ) Summary: as $title AMD provides a2av API, we should just use it instead of implementing PTD's own set of send/recv. we should not skip 0B send/recv within a2av, it may lead to dead lock: see details https://github.com/ROCm/rccl/pull/1349 Test Plan: before: mvai-job will timeout on all2all https://www.internalfb.com/mlhub/pipelines/runs/mast/fire-cenzhao-20240913-1426-327e119d?job_attempt=1&version=0&env=PRODUCTION after: https://www.internalfb.com/mlhub/pipelines/runs/mast/fire-cenzhao-20240919-1932-ebce94e6?job_attempt=0&tab=execution_details&env=PRODUCTION latest APS job: https://fburl.com/mlhub/vn6dj7zp Differential Revision: D63076315 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136368 Approved by: https://github.com/xw285cornell	2024-10-18 21:31:57 +00:00
Shivam Raikundalia	5b58697cc7	[Profiler] Clang bugs in Collection [1/n] (#138296 ) Summary: I have to keep bypassing issues because of these clang rules. Let's start with all of the bugs instead of the variable name ones because that will introduce a lot of lines of code and can make things hard to read Test Plan: Format tests pass. Differential Revision: D64411171 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138296 Approved by: https://github.com/aaronenyeshi, https://github.com/Skylion007	2024-10-18 21:06:50 +00:00
James Wu	295de00908	[PT2 Compile Events] Revamp PT2 Compile/chromium event logging [1/?] (#138093 ) This diff is the starting steps of https://docs.google.com/document/u/2/d/1kAEBt4AyW7HTAhXHbjoz8FBFHNyyEA2Qo2mPn7v3WUQ/edit?usp=drive_web&ouid=113555078003219714709 It implements the following changes: - Only log spans to scuba, so no start events are ever logged - Log events as the full event name, without "START" or "END" - Only log to scuba major phases from chromium events. These are: - entire_frame_compile (dynamo) - backend_compile (aotdispatch) - inductor_compile (inductor) - codegen (inductor codegen) Tlparse chromium events stay basically the same. But I implemented a few changes to clean that up as well: - When there's a phase name available, log the phase name instead of the function name as the event name. This simplifies the trace to not have two identical rows. The fn_name is avaliable as metadata on the chromium event, if interested - Log new events for pre and post grad passes. These do not log to scuba. By making the phases much simpler in Scuba, with only categories for major phases of PT2 Compilation, we pave the way to add much more metadata and information to each individual event type. Diffs for that will come later. IMPLEMENTATION NOTES: - The logic for `log_chromium_event_internal` (which is the function that logs to Scuba) lives in chromium_events for now, but in the future as we add more metadata, it may belong independently in dynamo_timed or even outside of dynamo_timed. I haven't explored in detail what the refactor will look like. Once we start logging metadata for dynamo, aotdispatch, inductor, I suspect we will call log_pt2_compile_event directly, instead of making chromium event logger handle the pt2_compile_event logic. But that refactor is left for another PR on top of this one. - There's an interesting space after pre grad passes within AOT autograd logic, that's between create_aot_dispatcher_function and pre grad passes. I'm not sure what we're spending time doing in that time, but I'll find out with a profile later. Differential Revision: [D64479033](https://our.internmc.facebook.com/intern/diff/D64479033/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138093 Approved by: https://github.com/ezyang	2024-10-18 20:36:08 +00:00
Ryan Guo	3c7d9d6c7f	[dynamo][NFC] Remove unused method `InliningInstructionTranslator.check_replace_is_safe` (#137906 ) This method was no longer needed after #113725; the checking logic is now in `SideEffects.check_allowed_side_effect`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137906 Approved by: https://github.com/Skylion007, https://github.com/anijain2305 ghstack dependencies: #137905	2024-10-18 20:20:42 +00:00
Ryan Guo	162eba2dee	[dynamo] Remove `mutable_local.source` and index on `VariableTracker` rather than `MutableLocalBase` (#137905 ) This patch addresses parts of the side-effect refactor proposed in #133027; specifically, it does 3 things: 1. Change `SideEffects.store_attr_mutations` and `PyCodegen.tempvars` to index on `VariableTracker` rather than `MutableLocalBase`. 2. Remove the `source` field from `MutableSideEffects` and `AttributeMutation`, and use `VariableTracker.source` instead. 3. Plumb a `overridden_sources: Dict[Source, Source]` from `handle_aliases_for_stolen_lists` to `PyCodegen` so that we don't update `VariableTracker.source` in place, while still preserving what `handle_aliases_for_stolen_lists` needed (i.e., modifying codegen for certain `VariableTracker`). (1) and (2) are merged in 1 patch because of some dependency between a. `OutputGraph.handle_aliases_for_stolen_lists` which iterates over `sideSideEffects.store_attr_mutations.keys()`, and potentially update its source field to be completely different. b. `SideEffects.codegen_update_mutated`, which happens after the above and uses `cg(var.mutable_local.source)`. where if we apply (1) only, (b) breaks, and if we apply (2) only, (a) breaks. (3) is needed for correctness, see comments in the PR for details. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137905 Approved by: https://github.com/jansel, https://github.com/anijain2305, https://github.com/mlazos	2024-10-18 20:20:42 +00:00
PyTorch MergeBot	7b39fb5712	Revert "Fix unbind_copy and add its decomposition (#134319 )" This reverts commit 9f81270d7589fd7fa98dc247ae4b1b7ab239ca3c. Reverted https://github.com/pytorch/pytorch/pull/134319 on behalf of https://github.com/clee2000 due to breaking some executorch tests D64568664 ([comment](https://github.com/pytorch/pytorch/pull/134319#issuecomment-2423157700))	2024-10-18 20:09:40 +00:00
Zain Rizvi	cd1e9b0e60	[EZ] Remove canary scale config (#138361 ) Removing just the LF canary scale config for now to test the changes in https://github.com/pytorch/test-infra/pull/5767 Those changes have been deployed to prod and appear to be working, but this will be the final proof that it is in fact reading the test-config version of scale-config and not the pytorch/pytorch copy. Note: This will break the Scale config validation workflow on test-infra, but it's worth it since this test will be very short lived and that workflow only runs when someone modifies scale config Pull Request resolved: https://github.com/pytorch/pytorch/pull/138361 Approved by: https://github.com/wdvr	2024-10-18 20:02:00 +00:00
Benjamin Glass	1ac42b5f3e	graph.py: Refine unspec variable finding (#137303 ) Add an additional check that scalars wrapped to 0-D tensors by dynamo are actually 0-D. This fixes a bug where a 1-D tensor was mistakenly converted to a scalar value rather than passed as a pointer. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137303 Approved by: https://github.com/eellison ghstack dependencies: #135701	2024-10-18 20:00:25 +00:00
Will Constable	d5bb70afe3	[Pipelining] Remove unnecessary {0,1} qualifier from regex (#138271 ) There should always be 1 action. This may be an artifact from trying to extend the regex to handle the fused SEND_F_RECV_B style actions, which was abandoned. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138271 Approved by: https://github.com/H-Huang ghstack dependencies: #138142	2024-10-18 19:52:07 +00:00
Will Constable	f23e8a8923	[Pipelining] Fix/improve format_pipeline_order (#138142 ) Fix issue where format fn modified original data structure- avoid this. Change from printing "None" to empty string, for cleaner visualization of bubbles Pull Request resolved: https://github.com/pytorch/pytorch/pull/138142 Approved by: https://github.com/H-Huang	2024-10-18 19:52:07 +00:00
Chong Gu	d512d0e227	Always use aten.constant_pad_nd for mm padding (#137820 ) Summary: From experiment, it seems like aten.constant_pad_nd has better QPS compared to torch.cat. The qps gain for ig ctr is ~10%, and ~5% for oc. Test Plan: ``` buck2 run mode/opt -c fbcode.nvcc_arch=a100 //caffe2/torch/fb/model_transform/experimental/benchmark:mts_gpu_benchmark -- --model-path=manifold://ads_storage_fblearner/tree/user/facebook/fblearner/predictor/585279927/480/gpu_lowering/input.predictor.disagg.gpu.merge --lower-backend=AOT_INDUCTOR ``` ``` buck2 run mode/opt //caffe2/torch/fb/model_transform/experimental/benchmark:mts_gpu_benchmark -- --model-path=manifold://ads_storage_fblearner/tree/user/facebook/fblearner/predictor/588102397/1500/gpu_lowering/input.predictor.disagg.gpu.merge --lower-backend=AOT_INDUCTOR ``` Differential Revision: D64271583 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137820 Approved by: https://github.com/eellison	2024-10-18 19:35:03 +00:00
David Berard	2f61b69603	[user triton] typing triton_kernel_wrap.py (#138230 ) Remove `# mypy: allow-untyped-defs` from triton_kernel_wrap.py, and fixed all the mypy errors. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138230 Approved by: https://github.com/oulgen, https://github.com/Skylion007	2024-10-18 19:29:31 +00:00
Tugsbayasgalan Manlaibaatar	1f32a1fb80	Replace torch.export default decomp table to be lazily populated (#137650 ) In this PR, we implement lazy dictionary for export decomp behaviour for following reasons: 1. Custom op loading can happen after import time, as a result, the decomp table might not be able to pick up the decomp. Therefore we try to delay materialization as late as possible. I intentionally seperated out the core_aten_decomp to not have any custom CIA ops in this PR to mitigate the risk of getting reverted but in the future, core_aten_decomp under torch/_decomp will exist as an alias to official export table (torch.export.default_decompositions) Differential Revision: [D64140807](https://our.internmc.facebook.com/intern/diff/D64140807) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137650 Approved by: https://github.com/justinchuby, https://github.com/bdhirsh	2024-10-18 19:28:52 +00:00
Nikita Shulga	ea8ea2f33f	Improve build_with_deb_info (#138290 ) To skip over the command that do not have output file specified Recently I've noticed that `generate_torch_version.py` started to run on every rebuild, and this results in a failed plan for deb info rebuilds Pull Request resolved: https://github.com/pytorch/pytorch/pull/138290 Approved by: https://github.com/Skylion007	2024-10-18 18:50:12 +00:00
Sam Larsen	4e9273c84e	[pt2] Log is_forward field to dynamo_compile scuba table (#138097 ) Summary: ^^ Test Plan: Ran a test script out of fbcode: D64350202. Then: ``` (pytorch-3.10_4) devvm2296:~/fbcode $ scuba -e="select time,co_filename,is_forward from \`dynamo_compile/sandbox\` where is_forward is not null" +------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------+ \| time \| co_filename \| is_forward \| +------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------+ \| 1729032583 \| /data/users/slarsen/fbsource/buck-out/v2/gen/fbcode/1638b36e975169f6/scripts/slarsen/torch_compile_model/__run__/run-inplace#link-tree/scripts/slarsen/torch_compile_model/run.py \| 1 \| \| 1729032583 \| null \| 0 \| \| 1729032650 \| /data/users/slarsen/fbsource/buck-out/v2/gen/fbcode/1638b36e975169f6/scripts/slarsen/torch_compile_model/__run__/run-inplace#link-tree/scripts/slarsen/torch_compile_model/run.py \| 1 \| \| 1729032650 \| null \| 0 \| +------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------+ 4 row(s) in set (0 warnings, 131 errors, 0.80 sec) ``` Reviewed By: ezyang Differential Revision: D64438144 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138097 Approved by: https://github.com/ezyang	2024-10-18 18:48:52 +00:00
Aaron Gokaslan	195d0a666b	[BE][Ez]: Use interned hardcoded string FURB156 (#138330 ) Uses string constants from string module. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138330 Approved by: https://github.com/albanD	2024-10-18 18:26:16 +00:00
Svetlana Karslioglu	9c2a80322a	Add Programmable Google Search (#137716 ) - Adding the code for the programmable Google search - Adding the CSS overrides. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137716 Approved by: https://github.com/seemethere, https://github.com/albanD Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>	2024-10-18 18:18:16 +00:00
Huy Do	8d869c9ec7	Skip test_circular_dependencies on ROCm (#138312 ) The test is flaky on ROCm and has been disabled for quite a while https://github.com/pytorch/pytorch/issues/110040. The disabled issue was opened and then closed several times, so it's better to close that issue and skip the test here. (Not really fix the issue, I just want the test to be skipped on PR instead of being disabled, then close the issue) Fixes https://github.com/pytorch/pytorch/issues/110040 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138312 Approved by: https://github.com/jithunnair-amd, https://github.com/clee2000	2024-10-18 18:17:48 +00:00
Jason Ansel	620039c38c	[inductor] Respect ir_dataclass(frozen=...) in Python 3.9 (#138247 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138247 Approved by: https://github.com/Skylion007, https://github.com/Chillee	2024-10-18 17:55:12 +00:00
PyTorch MergeBot	ada7a8c217	Revert "[CI] Add Compiled DDP and Compiled FSDP2 tests to test_inductor_distributed (#138178 )" This reverts commit 8cb91109061648497ca09d6f1f9b9e13a2f5557e. Reverted https://github.com/pytorch/pytorch/pull/138178 on behalf of https://github.com/yf225 due to because https://github.com/pytorch/pytorch/pull/138174 is reverted, we need to revert this too ([comment](https://github.com/pytorch/pytorch/pull/138178#issuecomment-2422961292))	2024-10-18 17:51:54 +00:00
Ryan Guo	59158f640c	[dynamo] Support equality comparison between Tensor and `None` (#138289 ) This patch updates the `wrap_fx_proxy_cls` function to allow boolean output when the operation is one of `supported_const_comparison_op_values`. Fixes #120907. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138289 Approved by: https://github.com/williamwen42	2024-10-18 17:49:26 +00:00
Aaron Orenstein	9ea271d40b	Expand doc for bundled autotune cache (#138298 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138298 Approved by: https://github.com/ezyang, https://github.com/oulgen	2024-10-18 17:43:47 +00:00
intellinjun	4bba038b2f	Add diagonal_copy to torch/_decomp/__init__.py (#136730 ) Fixes https://github.com/pytorch/pytorch/issues/117349 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136730 Approved by: https://github.com/masnesral	2024-10-18 17:39:17 +00:00
Catherine Lee	666572d819	Update viable strict workflow (#138262 ) Corresponds to https://github.com/pytorch/test-infra/pull/5775 Tested in https://github.com/pytorch/pytorch/actions/runs/11393196544/job/31700963325?pr=138262 by adding my branch to the environment and pointing the workflow at my test-infra branch and commenting out the parts that did the push + upload record to s3 Versioning would have been good for this... Pull Request resolved: https://github.com/pytorch/pytorch/pull/138262 Approved by: https://github.com/huydhn	2024-10-18 17:28:55 +00:00
atalman	912ea5601b	Move manywheel binary scripts to pytorch (#138103 ) PR to remove Manywheel Scripts: https://github.com/pytorch/builder/pull/2017 Test PR : https://github.com/pytorch/pytorch/pull/138325 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138103 Approved by: https://github.com/malfet	2024-10-18 17:11:28 +00:00
Li, Xingyuan	358ff3b731	[Inductor UT] Generalize newly introduced inductor UTs for intel GPU (Part 1) (#136069 ) [Inductor UT] Generalize Newly introduced inductor UTs for intel GPU reuse `test/inductor/test_autoheuristic.py` reuse `test/inductor/test_b2b_gemm.py` reuse `test/inductor/test_custom_lowering.py` reuse `test/inductor/test_efficient_conv_bn_eval.py` reuse `test/inductor/test_group_batch_fusion.py` Pull Request resolved: https://github.com/pytorch/pytorch/pull/136069 Approved by: https://github.com/etaf, https://github.com/EikanWang, https://github.com/jansel	2024-10-18 16:58:09 +00:00
Richard Barnes	8dd575faf6	[BE] Modernize `C10_UNUSED` (#138102 ) [`[[maybe_unused]]`](https://en.cppreference.com/w/cpp/language/attributes/maybe_unused) is part of C++17 standard Test Plan: Sandcastle Pull Request resolved: https://github.com/pytorch/pytorch/pull/138102 Approved by: https://github.com/Skylion007, https://github.com/albanD, https://github.com/malfet, https://github.com/eqy	2024-10-18 16:33:01 +00:00
Wu, Chunyuan	de51ed8610	[AOTI] Add C shim for _mkl_linear (#137880 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137880 Approved by: https://github.com/desertfire	2024-10-18 16:26:19 +00:00
PyTorch MergeBot	26ac5671dc	Revert "Fix CompiledDDP failure when the gradient is not contiguous (#138174 )" This reverts commit 0ecafda6024f50734118dd794ac71b86c6e6d569. Reverted https://github.com/pytorch/pytorch/pull/138174 on behalf of https://github.com/huydhn due to Sorry for reverting your PR, but I think it fails test_compute_comm_reordering in trunk for rocm and multigpu setup ([comment](https://github.com/pytorch/pytorch/pull/138174#issuecomment-2422818971))	2024-10-18 16:17:54 +00:00
Jean Schmidt	98856f7ea1	Increase max runners available for linux.12xlarge and windows.8xlarge.nvidia.gpu.nonephemeral (#138332 ) Related PR on test-infra: https://github.com/pytorch/test-infra/pull/5785 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138332 Approved by: https://github.com/clee2000, https://github.com/huydhn	2024-10-18 16:17:36 +00:00
PyTorch MergeBot	af306a392c	Revert "Dont decompose aten.baddmm in inductor (#137904 )" This reverts commit 7a117f3b3eea4cfeef21da2e3a8a1e39c30fa07d. Reverted https://github.com/pytorch/pytorch/pull/137904 on behalf of https://github.com/clee2000 due to unfortunately the failures on the previous import are still present on the current one D64568703 ([comment](https://github.com/pytorch/pytorch/pull/137904#issuecomment-2422789143))	2024-10-18 16:01:01 +00:00
ErezYosef	5a81475884	Documentation Update: Fix Missing Whitespace in Optimizer Docs (#138321 ) ### Description: This PR addresses a minor [formatting issue identified in a previous contribution to the Optimizer documentation](https://github.com/pytorch/pytorch/pull/134107#discussion_r1800833948). Specifically, it fixes the missing whitespace after `param_names` in the section on utilizing named parameters to load the optimizer state dict. You can find the related docs here: [Optimizer Documentation](https://pytorch.org/docs/main/optim.html#how-to-utilize-named-parameters-to-load-optimizer-state-dict). @janeyx99 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138321 Approved by: https://github.com/janeyx99	2024-10-18 15:41:43 +00:00
Aaron Orenstein	86aefa9405	typing subproc_pool.py (#138032 ) Added type annotations to subproc_pool.py. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138032 Approved by: https://github.com/Skylion007	2024-10-18 15:31:05 +00:00
Joona Havukainen	aa3ae50c07	Fixing MPS conv1d error message for output 2**16 (#134770 ) Fixes #134416 by removing the misleading message. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134770 Approved by: https://github.com/malfet	2024-10-18 14:13:20 +00:00
albanD	c4ed03cea1	Add proper handling for view and factory function for csan (#138236 ) In particular, properly handle that some functions only read/write metadata on the Tensor and thus should not be detected as read/write by csan. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138236 Approved by: https://github.com/ngimel	2024-10-18 14:04:18 +00:00
PyTorch MergeBot	0ff6f7a040	Revert "[Distributed][CI] Add SM guard for compiled tests involving BF16 (#138245 )" This reverts commit 1581a93e8705dc23f649573d4404cd6816d614af. Reverted https://github.com/pytorch/pytorch/pull/138245 on behalf of https://github.com/albanD due to Breaks distributed inductor tests ([comment](https://github.com/pytorch/pytorch/pull/138245#issuecomment-2422462579))	2024-10-18 13:21:17 +00:00
Xuan Zhang	e027403dea	ILP for Auto SAC (Selective Activation Checkpointing) (#137908 ) This PR presents a mixed integer linear programming (MILP) formulation that can be utilized to determine, under a memory budget, which modules to apply activation checkpointing (AC) and the amount of activation memory that should be discarded for each module. The MILP uses information collected from MemTracker, Runtime Estimator, and SAC Estimator, introduced in these PRs: * https://github.com/pytorch/pytorch/pull/124688 * https://github.com/pytorch/pytorch/pull/134243 * https://github.com/pytorch/pytorch/pull/135208 End-to-end example and its sample output: ``` import copy from typing import Tuple import torch from torch._subclasses.fake_tensor import FakeTensorMode from torch.distributed._tools.ilp_utils import ( aggregate_stats, get_peak_memory_runtime_baseline, parse_module_info, ) from torch.distributed._tools.mem_tracker import _ModState, MemTracker from torch.distributed._tools.runtime_estimator import RuntimeEstimator from torch.distributed._tools.sac_estimator import SACEstimator from torch.distributed._tools.sac_ilp import sac_milp from torch.testing._internal.distributed._tensor.common_dtensor import ( ModelArgs, Transformer, ) def _init_model_input_optimizer() -> Tuple[ torch.nn.Module, torch.optim.Optimizer, torch.Tensor ]: bsz = 8 model_args = ModelArgs( n_layers=4, n_heads=12, vocab_size=8192, max_seq_len=1024, dim=768, dropout_p=0.1, ) with torch.device(torch.cuda.current_device()): model = Transformer(model_args) optimizer = torch.optim.Adam(model.parameters(), lr=1e-2, foreach=True) inp = torch.randint( 0, model_args.vocab_size, (bsz, model_args.max_seq_len), device=torch.cuda.current_device(), ) return (model, optimizer, inp) def _run_and_get_mem_tracker( model: torch.nn.Module, optimizer: torch.optim.Optimizer, inp: torch.Tensor, ) -> MemTracker: mem_tracker = MemTracker() mem_tracker.track_external(model, optimizer) with mem_tracker as mt: for iter_idx in range(2): # running twice to initialize optimizer output = model(inp) output.sum().backward() if iter_idx == 1: last_snapshot = mt.get_tracker_snapshot("current") optimizer.step() optimizer.zero_grad() if iter_idx == 0: mt.reset_mod_stats() assert last_snapshot is not None for mod_stats in mem_tracker.memory_tracking.values(): if _ModState.POST_BW not in mod_stats.snapshots.keys(): mod_stats.snapshots.setdefault(_ModState.POST_BW, []).append( copy.deepcopy(last_snapshot) ) return mem_tracker def _run_and_get_runtime_estimator( model: torch.nn.Module, optimizer: torch.optim.Optimizer, inp: torch.Tensor, ) -> RuntimeEstimator: def _run_one_step() -> None: output = model(inp) output.sum().backward() optimizer.step() optimizer.zero_grad() # Initializing optimizer states and warm-up _run_one_step() runtime_estimator = RuntimeEstimator() with runtime_estimator(estimate_mode_type="operator-level-cost-model"): _run_one_step() # We use only one iteration for estimation return runtime_estimator def _run_and_get_sac_estimator( model: torch.nn.Module, inp: torch.Tensor, ) -> SACEstimator: sac_estimator = SACEstimator() with sac_estimator(estimate_mode_type="operator-level-cost-model"): loss = model(inp).sum() loss.backward() return sac_estimator def main(): with FakeTensorMode(): model, optimizer, inp = _init_model_input_optimizer() mem_tracker = _run_and_get_mem_tracker(model, optimizer, inp) runtime_estimator = _run_and_get_runtime_estimator(model, optimizer, inp) sac_estimator = _run_and_get_sac_estimator(model, inp) mod_info = aggregate_stats( model, mem_tracker, runtime_estimator, sac_estimator, torch.device(torch.cuda.current_device()), ) g = parse_module_info(mod_info) peak_mem, compute_time = get_peak_memory_runtime_baseline(g) print("=== WITHOUT AC ===") print(f"peak_mem: {round(peak_mem / 230, 2)} GiB") print(f"compute_time: {round(compute_time, 2)} ms") ac_decisions, recomputation_time, peak_mem = sac_milp(g, memory_budget=1.75) print("=== WITH AC ===") print(f"ac_decisions: {ac_decisions}") print(f"peak_mem: {round(peak_mem / 230, 2)} GiB") print(f"recomputation_time: {recomputation_time} ms") if __name__ == "__main__": main() ``` ``` === WITHOUT AC === peak_mem: 2.41 GiB compute_time: 97.97 ms === WITH AC === ac_decisions: {'Transformer.layers.0': 0.5232, 'Transformer.layers.1': 0.5232, 'Transformer.layers.2': 0.6849, 'Transformer.layers.3': 0.5232} peak_mem: 1.75 GiB recomputation_time: 5.92 ms ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/137908 Approved by: https://github.com/weifengpy	2024-10-18 12:45:37 +00:00
zeshengzong	7b863230ea	[Docs] Optimize parameter description to declare allowed type (2/N) (#138152 ) Inspired by issue #137422 and #103847 Optimize method parameter types in docs to given users a more clear about what expected to pass to methods. Previous PR: - [x] https://github.com/pytorch/pytorch/pull/137956 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138152 Approved by: https://github.com/albanD	2024-10-18 11:18:19 +00:00
Tom Ritchford	354bc3ac11	[dynamo] Remove an unused variable in repro.after_aot (#138094 ) * Extracted from https://github.com/pytorch/pytorch/pull/133492 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138094 Approved by: https://github.com/ezyang Co-authored-by: Edward Z. Yang <ezyang@meta.com>	2024-10-18 09:37:10 +00:00
Tom Ritchford	e1c4548441	[dynamo] Simplify creation of VariableTrackers (#135714 ) ## `VariableTracker::build()` hides the Builders ### The problem In the current code, creating a `VariableTracker` involves choosing one of two `Builder` classes and either calling a method, or calling a constructor that creates an object that you immediately call, [like this](`083c9149b7/torch/_dynamo/variables/functions.py (L761-L768)`). Variations on this code are repeated in many places. More, the `Builder` classes have a lot of dependencies, so they have to be loaded late in the whole import process to avoid circular imports, so they end up being repeatedly imported at local scope. ### The solution In this commit, the import from `builder` and the logic of choosing and calling the Builder class are hidden in a single static factory method, `VariableTracker.build()`, easier to reason about and to import. This commit net lowers the total lines of code by over 150 lines by removing repetitive logic and unnecessary local imports. CHANGES: Originally the name of the static method was `VariableTracker.create()` but a static method on a derived class, `LazyVariableTracker.create()` now exists with a different signature that's irreconcilable, so the new static method was renamed to `VariableTracker.build()`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135714 Approved by: https://github.com/jansel	2024-10-18 09:36:46 +00:00
Ke Wen	1581a93e87	[Distributed][CI] Add SM guard for compiled tests involving BF16 (#138245 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138245 Approved by: https://github.com/yf225	2024-10-18 09:10:01 +00:00
Will Feng	1a8b4c65ac	Fix scatter and gather shape check error message (#138310 ) The error message seems incorrect based on the surrounding code. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138310 Approved by: https://github.com/Microve, https://github.com/fegin	2024-10-18 07:49:07 +00:00
Tugsbayasgalan Manlaibaatar	517012058d	Move test_db to training IR (#138251 ) Differential Revision: [D64560792](https://our.internmc.facebook.com/intern/diff/D64560792) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138251 Approved by: https://github.com/yushangdi ghstack dependencies: #138249	2024-10-18 07:42:13 +00:00
Tugsbayasgalan Manlaibaatar	29264fcbef	Move test_verifier to training IR (#138249 ) Differential Revision: [D64560351](https://our.internmc.facebook.com/intern/diff/D64560351) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138249 Approved by: https://github.com/yushangdi	2024-10-18 07:36:29 +00:00
Avik Chaudhuri	5d01126616	preserve module signature with multiple calls (#137999 ) Previously we would error when trying to preserve the call signature for a module when it was called multiple times. This PR can now do this without erroring. The fix is to propagate call indices in a few more places. Note that while this works in the presence of params, buffers, and tensor constants, preserving call signatures for multiple calls to a module when buffers are mutated is not supported yet. This is future work. The main problem is that we do not have enough metadata to `copy_` mutated buffers at the end of each call to a module, so the next call can read those buffers at the beginning. Making this work will likely need some explicit tracking of intermediate values of mutated buffers when collecting metadata during functionalization in export. Note also that we stop short of creating a single graph out of multiple graphs: that is still future work. So the unflattened module will still have different targets `n`, `n@1`, `n@2`, etc. for each call when we ask the module call signature of `n` to be preserved. However it is way easier to swap all of these targets with a replacement that behaves similar to the original, because all of these calls will respect the original module call signature. (In particular, any constant inputs will be carried by the calls.) Differential Revision: D64406945 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137999 Approved by: https://github.com/tugsbayasgalan	2024-10-18 07:30:22 +00:00
Jing Xu	14e6624473	Update wmic command used in collect_env.py to its counterpart in powershell due to its deprecation (#138297 ) As title. `wmic` is deprecated in Windows. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138297 Approved by: https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2024-10-18 07:03:17 +00:00
Adnan Akhundov	d116d007ee	Add host-side Triton TMA support to Inductor (#137950 ) This adds Dynamo tracing support for the host-side Triton TMA API (see `create_2d_tma_descriptor` calls on the host in the [Triton tutorial](https://triton-lang.org/main/getting-started/tutorials/09-persistent-matmul.html#sphx-glr-getting-started-tutorials-09-persistent-matmul-py)). A few notes: - Here we assume the availability of the host-side TMA API added to upstream Triton in https://github.com/triton-lang/triton/pull/4498. As of time of writing, this is not a part of the PT2 OSS Triton pin (although back-ported internally). OSS Triton pin update should be done in December 2024. - Due to Dynamo support implemented in the previous PR, the `tma_descriptor_metadata` dict is delivered to the `triton_kerenl_wrap_` lowering and passed to the `ir.UserDefinedTritonKernel` as additional argument. - Looking into the `tma_descriptor_metadata`, `ir.UserDefinedTritonKernel` substitutes the corresponding `TensorBox` arguments of the kernel (swapped upstream in Dynamo) by the new `ir.TMADescriptor` nodes implementing TMA descriptors in Inductor IR. - `ir.TMADescriptor.__init__` provides the wiring between the upstream underlying `ir.TensorBox` and the downstream `ir.UserDefinedTritonKernel` kernel. In particular, we use `ir.NonOwnedLayout` wrapping `ir.ReinterpretView` to avoid the upstream tensor's buffer being deleted prematurely (before the TMA descriptor is used in the Triton kernel). - Via `ir.TMADescriptor.codegen`, the Triton's `create_{1d,2d}_tma_descriptor` function call is codegened in the wrapper (in the host code). - New `TMADescriptorArg` dataclass is added to handle the Triton kernel metadata pertinent to host-side TMA. - AOT Inductor support will be implemented in a follow-up PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137950 Approved by: https://github.com/eellison ghstack dependencies: #137677	2024-10-18 06:27:24 +00:00
zeshengzong	82443798aa	[Distributed] Refactor compress hook to remove duplicated code (#138182 ) Fix TODO in code ```python # TODO: create an internal helper function and extract the duplicate code in FP16_compress and BF16_compress. ``` 1. Extract common logic in `fp16_compress_hook` and `bf16_compress_hook` to `_compress_hook` method 2. Let `fp16_compress_hook` and `bf16_compress_hook` invoke `_compress_hook` with difference `dtype` Pull Request resolved: https://github.com/pytorch/pytorch/pull/138182 Approved by: https://github.com/awgu	2024-10-18 06:01:15 +00:00
Huy Do	80a58b7207	Use fresh cache directory in test_cudacodecache (#138243 ) This test frequently times out flakily, for example, https://github.com/pytorch/pytorch/actions/runs/11377972115/job/31654107609#step:22:2376. I still couldn't reproduce this behavior locally running this multiple times and in parallel. ~~So, I suspect that the error only shows up when other tests are run in paralel.~~ ~~I attempt to run this serially in this PR, once land, I can monitor trunk to see if this helps.~~ Running serially still ends up with a timing out https://github.com/pytorch/pytorch/actions/runs/11391445912/job/31697603438, another try with fresh cache. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138243 Approved by: https://github.com/clee2000	2024-10-18 05:45:39 +00:00
ur4t	0b168ceb6d	Collect Nvidia libraries with collect_env.py (#138076 ) Collect Nvidia libraries to diagnose issues like #133548. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138076 Approved by: https://github.com/malfet	2024-10-18 05:05:00 +00:00
Will Feng	8cb9110906	[CI] Add Compiled DDP and Compiled FSDP2 tests to test_inductor_distributed (#138178 ) `test_replicate_with_compiler.py` and `test_fully_shard_compile.py` requires bf16, so needs to be run within test_inductor_distributed job (which uses A10G (SM80) and has bf16 support). This allows us to migrate distributed jobs to T4 machines in https://github.com/pytorch/pytorch/pull/137161, as the compiled distributed jobs are the only blocking ones now. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138178 Approved by: https://github.com/xmfan, https://github.com/fduwjj, https://github.com/fegin, https://github.com/kwen2501	2024-10-18 04:58:58 +00:00
Nikita Shulga	a9014d2287	[BE][MPS] Compile without warnings on MacOS15 (#138238 ) By guarding the calls to `-[MTLCompileOptions setFastMathEnabled]` with `C10_DIAGNOSTIC_PUSH` and `POP` and using `-[MTLCompileOptions setMathMode:]` and `-[MTLCompileOptions setMathFloatingPointFunctions:]` on MacOS15 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138238 Approved by: https://github.com/atalman	2024-10-18 04:20:15 +00:00
Xingyuan Li	cc6c248919	[Inductor UT] Generalize newly introduced inductor UTs for intel GPU (Part 2) (#136856 ) [Inductor UT] Generalize Newly introduced inductor UTs for intel GPU reuse `test/inductor/test_inductor_freezing.py` reuse `test/inductor/test_layout_optim.py` reuse `test/inductor/test_loop_ordering.py` reuse `test/inductor/test_memory_planning.py` reuse `test/inductor/test_padding.py` Pull Request resolved: https://github.com/pytorch/pytorch/pull/136856 Approved by: https://github.com/EikanWang, https://github.com/etaf, https://github.com/jansel	2024-10-18 03:58:00 +00:00
Nikita Lutsenko	c3cd9939fc	aten \| Deduplicate and silence set but unused variable warning. (#138270 ) Summary: Turns out we have two functions called slightly differently but they do exactly the same thing. Also silence the warning if the message is stripped out. Test Plan: Sandcastle, no behavior change. Differential Revision: D64566719 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138270 Approved by: https://github.com/boguscoder, https://github.com/cyyever	2024-10-18 03:09:46 +00:00
William Wen	73a153b931	[dynamo] add compiler.set_stance raw function call test and doc example (#138276 ) Followup to https://github.com/pytorch/pytorch/pull/137504#issuecomment-2420107198 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138276 Approved by: https://github.com/anijain2305, https://github.com/jansel	2024-10-18 02:54:22 +00:00
Animesh Jain	8b426d80dc	[hops][refactor] Refactor the aliasing/mutation detection functions (#138234 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138234 Approved by: https://github.com/ydwu4 ghstack dependencies: #138231	2024-10-18 02:35:00 +00:00
Animesh Jain	e714ebf664	[dynamo][testing] Update AOTEagerandRecordGraphs backend (#138231 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138231 Approved by: https://github.com/StrongerXi, https://github.com/mlazos, https://github.com/aakhundov	2024-10-18 02:35:00 +00:00
Matt Pitkin	8a5dd7f59b	Allow SequentialLR to include ChainedScheduler (#133450 ) This fixes #132745 and allows a `SequentialLR` to include schedulers that are compound scheduler types (i.e., a `ChainedScheduler`), which contain a list of schedulers in a `_schedulers` attribute. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133450 Approved by: https://github.com/janeyx99	2024-10-18 02:29:38 +00:00
Yu, Guangye	8cda774a03	Add torch.xpu.get_arch_list and torch.xpu.get_gencode_flags for XPU (#137773 ) # Motivation Add `torch.xpu.get_arch_list()` and `torch.xpu.get_gencode_flags()` methods that return architecture list and AOT flags to preserve what flags PyTorch XPU was built with. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137773 Approved by: https://github.com/EikanWang, https://github.com/albanD	2024-10-18 02:28:08 +00:00
Jerry Zhang	6d8c9be54b	[reland] Add int1 to int7 dtypes (#137928 ) Summary: Similar to https://github.com/pytorch/pytorch/pull/117208, we want to add int1 to int7 for edge use cases for weight quantization Test Plan: python test/test_quantization.py -k test_uint4_int4_dtype Reviewers: Subscribers: Tasks: Tags: Differential Revision: [D64344944](https://our.internmc.facebook.com/intern/diff/D64344944) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137928 Approved by: https://github.com/malfet	2024-10-18 02:02:08 +00:00
Mengwei Liu	7365a57dc0	[BC] Add check for core ATen opset schema BC (#137664 ) Summary: Based on core ATen opset BC policy: https://dev-discuss.pytorch.org/t/core-aten-opset-backward-forward-compatibility-policy/1772 Encorcing this policy in `check_forward_backward_compatibility.py`. Basically the script will error out if any BC breaking schema changes occurs to core ATen operators. Test Plan: Run `python test/forward_backward_compatibility/dump_all_function_schemas.py --filename nightly_schemas.txt` Manually added a argument to `nightly_schemas.txt`, `convolution` schema, see the following error: ``` [WARNING 2024-10-09 15:54:36,224 check_forward_backward_compatibility.py:329] Can NOT find backward compatible schemas after changes for schema aten::convolution(Tensor input, Tensor weight, Tensor? bias, SymInt[] stride, SymInt[] padding, SymInt[] dilation, bool transposed, SymInt[] output_padding, SymInt groups, SymInt new_arg) -> Tensor from the following candidates: [ aten::convolution(Tensor input, Tensor weight, Tensor? bias, SymInt[] stride, SymInt[] padding, SymInt[] dilation, bool transposed, SymInt[] output_padding, SymInt groups) -> Tensor aten::convolution.out(Tensor input, Tensor weight, Tensor? bias, SymInt[] stride, SymInt[] padding, SymInt[] dilation, bool transposed, SymInt[] output_padding, SymInt groups, *, Tensor(a!) out) -> Tensor(a!) ]. Please contact PyTorch team to confirm if this BC breaking change is safe or not. ... [WARNING 2024-10-09 15:54:36,224 check_forward_backward_compatibility.py:342] The PR is introducing backward incompatible changes to core ATen operators. Please contact PyTorch team to confirm whether this change is wanted or not. Broken ops: [ aten::convolution(Tensor input, Tensor weight, Tensor? bias, SymInt[] stride, SymInt[] padding, SymInt[] dilation, bool transposed, SymInt[] output_padding, SymInt groups, SymInt new_arg) -> Tensor ] ``` Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/137664 Approved by: https://github.com/albanD	2024-10-18 01:58:33 +00:00
Shuqiang Zhang	21a9c06ca9	[c10d] differentiate timeout errors from nccl errors (#138240 ) Summary: Our watchdog does not differentiate timeout from NCCL errors clearly in terms of both log and code paths. It's important for c10d to differentiate different reasons of watchdog failures. E.g, timeout vs nccl errors, and possibly let users to handle the errors differently depends on the type of errors Test Plan: UT Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/138240 Approved by: https://github.com/Skylion007	2024-10-18 01:36:32 +00:00
Pian Pawakapan	95f869c3d7	[pytorch_operator_stats] log if using torchscript runtime (#137986 ) Summary: logs if an operator is run with the TorchScript runtime, using a thread_local variable set in `InterpreterState.run()` Test Plan: buck2 run mode/dev-nosan caffe2/torch/fb/observers:scuba_observer_runner Reviewed By: zou3519 Differential Revision: D64200781 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137986 Approved by: https://github.com/angelayi	2024-10-18 00:55:22 +00:00
FFFrog	ad28565ed7	Use C++17 Convention Methods in PyTorch (#137958 ) Detailed Descriptions: - `std::is_same<X, Y>::value` -> `std::is_same_v<X, Y>` - `std::enable_if<C, T>::type` -> `std::enable_if_t<C, T>` - and so on Pull Request resolved: https://github.com/pytorch/pytorch/pull/137958 Approved by: https://github.com/janeyx99	2024-10-18 00:52:51 +00:00
Nikita Lutsenko	b7cf8fb800	c10 \| Silence 'deprecated-dynamic-exception-spec' warning when importing cxxabi. (#138219 ) Summary: cxxabi header specifically from llvm violates this, ignore the warning when including it. Test Plan: No runtime behavior change, sandcastle only Differential Revision: D64540217 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138219 Approved by: https://github.com/boguscoder	2024-10-18 00:42:45 +00:00
Will Feng	2f91d7c63f	[Compiled Autograd] Check Dynamo stance to decide whether to fallback to eager (#138113 ) Dynamo stance is recently added in https://github.com/pytorch/pytorch/pull/137504. When Dynamo stance is "force_eager" (user explicitly wants to fall back to eager), we would like Compiled Autograd to fall back to eager as well. This will allow the Traceable FSDP2 use case to work since "eager forward + compiled autograd backward" is not supported for Traceable FSDP2. In general, if user wants to do "eager forward + compiled autograd backward", they should explicitly run the forward in eager instead of applying compile and then set stance to "force_eager". Pull Request resolved: https://github.com/pytorch/pytorch/pull/138113 Approved by: https://github.com/xmfan	2024-10-18 00:13:00 +00:00
Sahan Paliskara	6d473e0dda	[autolint] move to use a label (#138263 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/138263 Approved by: https://github.com/huydhn	2024-10-18 00:12:52 +00:00
Nikita Shulga	a3172809a1	[EZ] Fix typo in Normalization.mm (#138283 ) Introduced by `6b76a21ebd` One likely has to wait for 125 years to MacOS-150 release :) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138283 Approved by: https://github.com/kit1980	2024-10-18 00:01:21 +00:00
Xiaodong Wang	b14c9b7250	[AMD] Hipify torchaudio_decoder (#138181 ) Summary: X-link: https://github.com/pytorch/audio/pull/3843 Continue to hipify more torchaudio targets. Test Plan: CI buck build mode/opt-amd-gpu pytorch/audio/src/... Differential Revision: D64298970 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138181 Approved by: https://github.com/houseroad	2024-10-17 23:37:37 +00:00
Will Feng	0ecafda602	Fix CompiledDDP failure when the gradient is not contiguous (#138174 ) Summary: As title Pull Request resolved: https://github.com/pytorch/pytorch/pull/138174 Approved by: https://github.com/yf225, https://github.com/kwen2501 Co-authored-by: Will Feng <yf225@cornell.edu>	2024-10-17 23:08:24 +00:00
Benjamin Glass	2fc6c32b4c	Ensure version file is regenerated at change (#138237 ) Fixes observed error where `version.py` would not be regenerated by CMake without deleting the file. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138237 Approved by: https://github.com/Skylion007	2024-10-17 22:46:05 +00:00
Xinya Zhang	770fcaf2ab	Fix the Rank of logsumexp Tensor and mGPU support. (#137717 ) The logsumexp tensor was considered for internal use only but apparently exposed to unit tests and inductors. The stream should be selected after picking the current device. Otherwise the code is checking the default device's architecture. Fixes #131316 #137414 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137717 Approved by: https://github.com/drisspg Co-authored-by: Jack Taylor <108682042+jataylo@users.noreply.github.com>	2024-10-17 21:58:14 +00:00
Tom Ritchford	9f81270d75	Fix unbind_copy and add its decomposition (#134319 ) * Fixes https://github.com/pytorch/pytorch/issues/130829 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134319 Approved by: https://github.com/amjames, https://github.com/eellison	2024-10-17 21:27:35 +00:00
albanD	69ba89da11	Fix cuda sanitizer and as_subclass calls (#138218 ) This fixes 4 main issues: - The way the cuda sanitizer handle it's state is weird. In particular, because the lifetime of the Mode is linked to the submodule, then this might outlive the python runtime and other modules loaded. On my current version, this even outlives the "sys" module. Given that I'm not sure the impact of changing this lifetime handling, I'm making the exit handler a no-op when python is already dying and thus no point cleaning up. - Adds a "disable" method to be able to test after the mode is enabled. - Fix `Tensor.as_sublass()` to properly disable modes when creating the new Tensor object just like we already do in `make_subclass` and `make_wrapper_subclass`. The change here is just to apply the exact same treatment to it. - ~Fix `Tensor.as_subclass()` not to propagate autograd as there is no valid backward associated here.~ We have test that check that this behavior happen so I guess this is not an obvious bugfix and expected behavior. Reverted that change. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138218 Approved by: https://github.com/ngimel	2024-10-17 21:18:32 +00:00
Edward Yang	b14269dcfb	Make Context to be Device-agnostic Step by Step (1/N) (#136519 ) (#138155 ) Summary: - make init to be device-agnostic and move it to AcceleratorHooksInterface - refactoring context related to device initialization Original pull request: https://github.com/pytorch/pytorch/pull/136519 Test Plan: contbuild & OSS CI, see `4a8e49389c` Reviewed By: malfet Differential Revision: D64471142 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138155 Approved by: https://github.com/malfet, https://github.com/bobrenjc93	2024-10-17 20:58:56 +00:00
eellison	7a117f3b3e	Dont decompose aten.baddmm in inductor (#137904 ) Previously the decomposition would upcasts inputs to fp32. This led to a slowdown compared to eager which would run in fp16. We also tried keeping the bmm in fp16, and the upcasting for the epilogue but that led to worse numerics because the bmm in eager would do the epilogue all in fp32 without a downcast in the bmm accumulator. Fix for https://github.com/pytorch/pytorch/issues/137897 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137904 Approved by: https://github.com/ngimel	2024-10-17 19:24:54 +00:00
Jane Xu	54839781ed	Update lint failure msg to encourage lintrunner -a locally (#138232 ) This is only a minor patch that I hope will change how I talk to contributors when lint fails, so that I can tell them to read the logs about lintrunner. There have been too many times when I have had to click the "approve all workflows" just for lint to fail again cuz the developer is manually applying every fix and using CI to test. I understand there are times when lintrunner doesn't work, but I'd like most contributors to at least give it a swirl once to start. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138232 Approved by: https://github.com/kit1980, https://github.com/Skylion007	2024-10-17 19:13:55 +00:00
Shivam Raikundalia	dfb5ac05cc	[Record Function] Add Kwargs only USER_SCOPE Macro (#138020 ) Summary: Add a macro such that users can easily add a USER annotation with kwargs only Test Plan: Will use D63801503 to test this E2E. Added unit test as well that makes sure that the kwargs get recorded correctly Differential Revision: D64420328 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138020 Approved by: https://github.com/davidberard98, https://github.com/aaronenyeshi	2024-10-17 18:48:49 +00:00
Will Feng	0c76c68d7d	[tlparse][AOTAutograd] Rename to aot_inference_graph in tlparse output (#137803 ) Compiled Autograd uses this AOT inference path, but it shows up as "aot_forward_graph" in tlparse output, which causes it to not be easily differentiable from normal "aot_forward_graph"s that are also in the tlparse output. This PR renames it to "aot_inference_graph" which makes it easier to tell which tlparse graph block is from Compiled Autograd. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137803 Approved by: https://github.com/Microve, https://github.com/bdhirsh, https://github.com/ezyang	2024-10-17 18:44:37 +00:00
zeshengzong	d531bd509e	[Docs] Fix description in `torch.save` docs to show default for pickle_protocol instead of variable name (#138153 ) Fixes #138013 Replace `DEFAULT_PROTOCOL` with actual default value `2` in `torch.save` method document Before ![image](https://github.com/user-attachments/assets/cdd77d14-c009-4848-8538-9256bf22c32a) After ![image](https://github.com/user-attachments/assets/f6b1063d-c955-478a-8d42-702b988426aa) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138153 Approved by: https://github.com/mikaylagawarecki	2024-10-17 18:13:05 +00:00
Richard Barnes	8abbd1c7c7	Modernize C10_NODISCARD to [[nodiscard]] (#138151 ) PyTorch is C++17 now. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138151 Approved by: https://github.com/Skylion007, https://github.com/albanD	2024-10-17 18:07:39 +00:00
chilli	6752e7dc3e	Moved some of Inductor IR nodes to be frozen (#137859 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137859 Approved by: https://github.com/ezyang	2024-10-17 18:04:45 +00:00
Michael Lazos	0b2c12cb4d	Support more foreach ops for tensor beta support (#134170 ) Add more foreach ops so we don't have fallbacks. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134170 Approved by: https://github.com/eellison	2024-10-17 17:51:31 +00:00
William Wen	92fdea8a39	remove skips due to https://github.com/pytorch/torchdynamo/issues/1991 (#138133 ) Closes https://github.com/pytorch/pytorch/issues/93479. A bunch of other dynamo-wrapped tests also exhibit "torch.* returned non-Tensor output unimplemented" making the issue seem less relevant to me. Some tests are marked as xfail as they fail for other reasons. If these tests are indeed important, we should create a new issue to track them. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138133 Approved by: https://github.com/ezyang	2024-10-17 17:42:46 +00:00
Scott Wolchok	6b76a21ebd	[PyTorch] Fix incorrect macOS 15.0 gating in MPS backend (#138022 ) The ifdef as written just checks if the macOS 15.0-capable SDK is being used. You also need a runtime gate to make sure macOS 15 is in use. Differential Revision: [D64429453](https://our.internmc.facebook.com/intern/diff/D64429453/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138022 Approved by: https://github.com/Skylion007, https://github.com/malfet ghstack dependencies: #137722, #138014	2024-10-17 17:35:34 +00:00
PyTorch MergeBot	d2a6c73235	Revert "[CI] Add Compiled DDP and Compiled FSDP2 tests to test_inductor_distributed (#138178 )" This reverts commit 20af56d4359c3f5fed2e8f94e111a8502f2ebeb3. Reverted https://github.com/pytorch/pytorch/pull/138178 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but the new tests are failing inductor distributed jobs ([comment](https://github.com/pytorch/pytorch/pull/138178#issuecomment-2420109501))	2024-10-17 17:32:06 +00:00
Tugsbayasgalan Manlaibaatar	2a50d77823	Move test_experimental.py to training IR (#138140 ) Differential Revision: [D64510938](https://our.internmc.facebook.com/intern/diff/D64510938) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138140 Approved by: https://github.com/avikchaudhuri	2024-10-17 17:30:10 +00:00
Joel Schlosser	ecc5e05854	Refactor NJT min / max seqlen handling for convenience (#138130 ) There's an annoying pattern emerging for pulling out the NJT min / max seqlen ints if they exist without computing / caching if they don't. This PR introduces private convenience functions to simplify handling this and avoiding redundant checks. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138130 Approved by: https://github.com/soulitzer	2024-10-17 17:28:39 +00:00
PyTorch MergeBot	66478d0cf7	Revert "[compiled autograd] directly use python Logger class in cpp (#137953 )" This reverts commit af916613687d3bcc1d15362ba2fdf9312378c500. Reverted https://github.com/pytorch/pytorch/pull/137953 on behalf of https://github.com/clee2000 due to breaking builds internally D64479234, I think it makes the build size of a package too large? The logs link to a wiki with instructions of what to do ([comment](https://github.com/pytorch/pytorch/pull/137953#issuecomment-2420086928))	2024-10-17 17:19:36 +00:00
PyTorch MergeBot	3b0f3059f6	Revert "[Compiled Autograd] Check Dynamo stance to decide whether to fallback to eager (#138113 )" This reverts commit ebe37b23f11e150cd3afa5464193ee036e15277f. Reverted https://github.com/pytorch/pytorch/pull/138113 on behalf of https://github.com/clee2000 due to sorry need to revert this in order to revert https://github.com/pytorch/pytorch/pull/137953, please rebase and remerge ([comment](https://github.com/pytorch/pytorch/pull/138113#issuecomment-2420079703))	2024-10-17 17:16:44 +00:00
PyTorch MergeBot	375dcb960f	Revert "Avoid some dangling reference warnings (#132535 )" This reverts commit f3d7a02716d8725dcedff86094bd7e20f73155f1. Reverted https://github.com/pytorch/pytorch/pull/132535 on behalf of https://github.com/clee2000 due to broke some internal builds D64479234 ([comment](https://github.com/pytorch/pytorch/pull/132535#issuecomment-2419983509))	2024-10-17 16:23:36 +00:00
Shangdi Yu	348f208504	Autocast re-tracibility (#138082 ) Summary: Support autocast re-tracing by giving it the same treatment as set_grad. In re-tracing, when dynamo encounters an autocast HOP, we want it to trace through `with torch.autocast()` again, and replace the HOP with the traced subgraph. Test Plan: ``` buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test:test_export -- -r test_export_with_autocast ``` Differential Revision: D63856081 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138082 Approved by: https://github.com/ydwu4	2024-10-17 16:09:11 +00:00
Yidi Wu	3087b5e431	[cond] support lifted symint inputs in subgraph (#137519 ) As titled. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137519 Approved by: https://github.com/eellison	2024-10-17 16:09:06 +00:00
Zhuoran Zhao	2414c3f534	AOTI fixes for MI300 lowering (#137939 ) Summary: 1) Add sleef back to enable SIMD on AMD 2) adding kpack to triton compute_meta for AMD triton, since there will be user-defined triton kernels using this for k-dim packing Test Plan: ``` HIP_VISIBLE_DEVICES=0 TORCHINDUCTOR_UNIQUE_KERNEL_NAMES=1 TORCH_LOGS="output_code,graph_code" buck run mode/{opt,amd-gpu} -c fbcode.triton_backend=amd -c fbcode.enable_gpu_sections=true //hpc/new/models/feed/benchmark:feed_lower_benchmark -- --skip-flop-estimation --skip-trt --skip-ait --enable-aot-inductor --sync-mode=0 --gpu-trace --sample-input-tile-factor=1 --load="manifold://ads_storage_fblearner/tree/user/facebook/fblearner/predictor/925729118/0/gpu_lowering/input.merge" --lowering-input-str='{"serialized_inference_model_input_path":"ads_storage_fblearner/tree/user/facebook/fblearner/predictor/925729118/0/gpu_lowering/input.merge","serialized_inference_model_output_path":"ads_storage_fblearner/tree/user/facebook/fblearner/predictor/925729118/0/gpu_lowering/mi300_output.merge","submodule_names_to_lower":["merge"],"inductor_lowering_context":{"aot_inductor_lowering_settings":{"use_scripting":true,"preset_lowerer":"ifu_cint;disable_new_lowering_weights;disable_dper_passes:passes=fuse_parallel_linear_no_weight_change","precision":3,"output_precision":3, "remove_unexpected_type_cast":false, "sample_input_tile_factor":32}},"model_entity_id":925729118,"model_snapshot_id":0,"add_sample_inputs":false,"hardware_type":0,"platform_arch":1,"dense_in_place_format":2}' --precision=bf16 2>&1 \| tee local_benchmark_log.txt ``` Differential Revision: D64262924 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137939 Approved by: https://github.com/frank-wei	2024-10-17 16:09:04 +00:00
Sungmin Cho	502c6183e0	Prevent tuple instances from being weak-referenced. (#137838 ) Summary: Currently, https://fburl.com/code/uka25j1i checks whether the guarded object supports weakref by looking at its `__class__` ``` if hasattr(guarded_object.__class__, "__weakref__") and not isinstance( guarded_object, enum.Enum ): obj_ref = weakref.ref(guarded_object) ``` However, we have reason to modify this slightly because we use classes that "pretend" to be some other classes (e.g. nn.Parameter). Example https://fburl.com/code/8bcktgoh : ``` class QuantizedWeights: # TODO: Ugly trick so torch allows us to replace parameters # with our custom weights. Do this properly. property def __class__(self) -> Type[nn.parameter.Parameter]: return nn.Parameter property def grad_fn(self) -> None: return None ``` For example, Fp8RowwiseWeights which inherit from the base class above and also from namedtuple, actually does not have `__weakref__` attribute, but its "class" will say it does. I think the easiest change is to use instance-level checking rather than class-level ``` if hasattr(guarded_object, "__weakref__") ... ``` But I'm wondering if this will harm any of the existing behaviors. I'd appreciate reviews from the experts (I just added all recommended reviewers since I'm not sure who is the best person to consult...) Test Plan: CI? Reviewed By: YJYJLee Differential Revision: D64140537 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137838 Approved by: https://github.com/williamwen42, https://github.com/jansel	2024-10-17 16:08:32 +00:00
Laith Sakka	7e16c9d5f2	include bw_compiler in strobelight profile (#138060 ) Summary: title + tlparse will have the phase name. Test Plan: {F1933087525} Differential Revision: D64450315 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138060 Approved by: https://github.com/ezyang	2024-10-17 16:08:28 +00:00
Will Feng	20af56d435	[CI] Add Compiled DDP and Compiled FSDP2 tests to test_inductor_distributed (#138178 ) `test_replicate_with_compiler.py` and `test_fully_shard_compile.py` requires bf16, so needs to be run within test_inductor_distributed job (which uses A10G (SM80) and has bf16 support). This allows us to migrate distributed jobs to T4 machines in https://github.com/pytorch/pytorch/pull/137161, as the compiled distributed jobs are the only blocking ones now. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138178 Approved by: https://github.com/xmfan	2024-10-17 10:51:07 +00:00
CaoE	8cfe28e4e3	[Inductor] Pick ISA for inductor based on ATEN_CPU_CAPABILITY (#123514 ) It is part of https://github.com/pytorch/pytorch/issues/123224. Pick ISA based on the environment ATEN_CPU_CAPABILITY to control CPU vec ISA level for Inductor like eager. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123514 Approved by: https://github.com/jgong5, https://github.com/peterbell10	2024-10-17 09:06:57 +00:00
Tom Ritchford	47077bfcb5	Remove an unused variable in _subclasses.fake_tensor (#138086 ) ---- * Extracted from https://github.com/pytorch/pytorch/pull/133492 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138086 Approved by: https://github.com/Skylion007, https://github.com/albanD	2024-10-17 09:05:25 +00:00
Laith Sakka	ba10259115	Increase default COMPILE_STROBELIGHT_MAX_STACK_LENGTH to 500 (#138006 ) Summary: pt2 call stacks are long, this reduces truncated stack <img width="1363" alt="Screenshot 2024-10-15 at 11 35 11 AM" src="https://github.com/user-attachments/assets/d09a8fb5-eafc-4440-ab58-464889dc6df8"> <img width="1373" alt="Screenshot 2024-10-15 at 11 35 26 AM" src="https://github.com/user-attachments/assets/c4c9c245-54d1-4e35-b16f-029ece335e03"> Differential Revision: D64414746 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138006 Approved by: https://github.com/bobrenjc93	2024-10-17 07:31:32 +00:00
William Wen	5b7f4767ff	Fix https://github.com/pytorch/pytorch/issues/138062 (#138137 ) Fixes https://github.com/pytorch/pytorch/issues/138062 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138137 Approved by: https://github.com/mlazos	2024-10-17 07:12:15 +00:00
Tugsbayasgalan Manlaibaatar	f3c3f3a3c3	Fix assigning tensor with requires_grad as constant in export (#137997 ) When we insert cojstants into unlifted graph, we need to detach them if they require grad BUT when we detach we need to preserve the original aliasing information. Differential Revision: [D64406859](https://our.internmc.facebook.com/intern/diff/D64406859/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137997 Approved by: https://github.com/avikchaudhuri	2024-10-17 06:41:10 +00:00
Edward Z. Yang	38d9924bfc	Disable lint suggestions on my PRs (#138054 ) The suggestions unusably clog up early draft PRs that are not necessarily lint clean yet. Making matters worse, even if I fix them I have to manually click through hundreds of comments to "Resolve" them even though I've fixed it. Disabling it on ghstack helps, but I occasionally do standard PRs via fbcode export mechanism. Opt me out. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/138054 Approved by: https://github.com/huydhn, https://github.com/malfet, https://github.com/PaliC	2024-10-17 05:28:37 +00:00
cyy	af8bd323e8	Remove legacy Caffe2 pthreadpool from CMake (#134936 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/134936 Approved by: https://github.com/ezyang	2024-10-17 05:22:08 +00:00
Josh Fromm	9c084cccfd	[Pytorch][ATEN] Enable FP8 concatenate (#138046 ) Summary: Float8 is becoming and increasingly popular datatype now that it is well supported on GPUs. This diff enables FP8 to work with `torch.cat`. This is pretty straight forward since memory operations dont vary based on the input dtype, but can be quite helpful for FP8 based models. Test Plan: ``` buck2 run mode/opt -c fbcode.enable_gpu_sections=true -c fbcode.platform=platform010 -c fbcode.nvcc_arch=h100a -c fbcode.platform010_cuda_version=12 //caffe2/test:tensor_creation -- -r test_cat_all_dtypes_and_devices ``` Differential Revision: D64443965 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138046 Approved by: https://github.com/eqy, https://github.com/qchip, https://github.com/jianyuh	2024-10-17 04:58:54 +00:00
Jing Xu	ebd60f4074	update CMAKE_PREFIX_PATH setting command (#134934 ) Current setting command of the `CMAKE_PREFIX_PATH` environment variable will overwrite values if it had already been set with some values. Changing it to `:` appends the conda env search path to its values to avoid library not found issues. `export CMAKE_PREFIX_PATH=${CONDA_PREFIX:-"$(dirname $(which conda))/../"}:${CMAKE_PREFIX_PATH}` Pull Request resolved: https://github.com/pytorch/pytorch/pull/134934 Approved by: https://github.com/malfet, https://github.com/EikanWang	2024-10-17 04:19:18 +00:00
Edward Z. Yang	7db1f0b7b5	Minor assert error message improvement (#138053 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/138053 Approved by: https://github.com/Skylion007	2024-10-17 03:54:15 +00:00
Will Feng	ebe37b23f1	[Compiled Autograd] Check Dynamo stance to decide whether to fallback to eager (#138113 ) Dynamo stance is recently added in https://github.com/pytorch/pytorch/pull/137504. When Dynamo stance is "force_eager" (user explicitly wants to fall back to eager), we would like Compiled Autograd to fall back to eager as well. This will allow the Traceable FSDP2 use case to work since "eager forward + compiled autograd backward" is not supported for Traceable FSDP2. In general, if user wants to do "eager forward + compiled autograd backward", they should explicitly run the forward in eager instead of applying compile and then set stance to "force_eager". Pull Request resolved: https://github.com/pytorch/pytorch/pull/138113 Approved by: https://github.com/xmfan ghstack dependencies: #138105	2024-10-17 03:45:10 +00:00
Bin Bao	fe43f72be7	[AOTI] Remove the non-ABI-compatible mode (part 2) (#138047 ) Summary: Continue to clean up non-ABI-compatible mode related code. Differential Revision: [D64444327](https://our.internmc.facebook.com/intern/diff/D64444327) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138047 Approved by: https://github.com/chenyang78 ghstack dependencies: #137982, #138016, #138009	2024-10-17 02:54:24 +00:00
Bin Bao	2e67d7cc35	[AOTI] Remove the non-ABI-compatible mode (part 1) (#138009 ) Summary: The ABI-compatible mode has been turned on as default in https://github.com/pytorch/pytorch/pull/136534. Removing the non-ABI-compatible logic to greatly simplify the wrapper codegen logic. Differential Revision: [D64439676](https://our.internmc.facebook.com/intern/diff/D64439676) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138009 Approved by: https://github.com/chenyang78 ghstack dependencies: #137982, #138016	2024-10-17 02:48:26 +00:00
Nikita Shulga	7711f00553	[BE] Delete unused operator!= from the test (#138122 ) If method is unused, why not delete it altogether? Pull Request resolved: https://github.com/pytorch/pytorch/pull/138122 Approved by: https://github.com/swolchok	2024-10-17 02:24:48 +00:00
Joel Schlosser	906fe05895	Naive impls for NJT matmul (#138121 ) Our matmul support is abysmal - let's at least get this working and do it performantly later. Bonus: implements `bmm` as well. jagged <-> padded dense conversions are utilized when possible, and an unbind-based fallback otherwise (the former works with torch.compile and the latter doesn't). Some testing is missing because we don't have factory function support yet :( Pull Request resolved: https://github.com/pytorch/pytorch/pull/138121 Approved by: https://github.com/cpuhrsch	2024-10-17 01:31:46 +00:00
zeshengzong	b4f7f4bf49	[Docs] Optimize parameter description to declare allowed type (1/N) (#137956 ) Inspired by issue #137422 and #103847 Optimize method parameter types in docs to given users a more clear about what expected to pass to methods. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137956 Approved by: https://github.com/albanD	2024-10-17 01:19:55 +00:00
Yifu Wang	c69f4518ec	[SymmetricMemory] fix a race condition in _pipelined_produce_and_all2all that can cause correctness issues for very small `chunk_producer`s (#138126 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138126 Approved by: https://github.com/lessw2020	2024-10-17 01:05:41 +00:00
Benjamin Glass	69e125a7e9	AOTInductor: fixup test (follow-up to #137401 ) (#137692 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137692 Approved by: https://github.com/desertfire	2024-10-17 00:40:21 +00:00
Jane Xu	94537e70b5	Skip test_parity__foreach_mul_fastpath_inplace_cuda_complex128 internally (#138100 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138100 Approved by: https://github.com/Skylion007	2024-10-17 00:34:56 +00:00
Will Feng	504904c9c6	[Traceable FSDP2] Add compiled_autograd_enabled helper function (#138105 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138105 Approved by: https://github.com/awgu, https://github.com/xmfan	2024-10-17 00:04:06 +00:00
Avik Chaudhuri	0e9708f907	tensor constant with wrapped method (#138091 ) Summary: Tensor constants can show up through wrapped methods, so that they may not always be found in constant attributes. They need to be fakified and their meta vals need to be found to create graph signatures nevertheless. Otherwise non-strict barfs. Longer term maybe we should pull this fakification up in non-strict. Test Plan: added test Differential Revision: D64480272 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138091 Approved by: https://github.com/tugsbayasgalan	2024-10-17 00:00:04 +00:00
PyTorch MergeBot	4b3035f2fe	Revert "Add decomposition for permute_copy (#130944 )" This reverts commit e7a4ad3b409c226a1da0f597c66ece7c06de0e9e. Reverted https://github.com/pytorch/pytorch/pull/130944 on behalf of https://github.com/clee2000 due to breaking internal builds D64418214 cc @digantdesai @GregoryComer to help get this fixed and remerged ([comment](https://github.com/pytorch/pytorch/pull/130944#issuecomment-2418125356))	2024-10-16 23:18:53 +00:00
PyTorch MergeBot	5254a0d383	Revert "Dont decompose aten.baddmm in inductor (#137904 )" This reverts commit cef6c3dcb07aafe25d62427e55442a46d7af3500. Reverted https://github.com/pytorch/pytorch/pull/137904 on behalf of https://github.com/clee2000 due to failing internal tests D64418200, some results not within tolerance? ([comment](https://github.com/pytorch/pytorch/pull/137904#issuecomment-2418122735))	2024-10-16 23:16:44 +00:00
Brian Hirsh	ea2726452a	add myself as codeowner in aot_autograd (#138075 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138075 Approved by: https://github.com/Skylion007, https://github.com/albanD ghstack dependencies: #136670	2024-10-16 22:41:39 +00:00
Brian Hirsh	a682194a11	inductor: use previous guards to know if a size is 1 for broadcasting (#136670 ) Fixes https://github.com/pytorch/pytorch/issues/136640 Today, inductor has some logic to figure out when it needs to do broadcasting during lowering, which just checks if any of the input shapes have sizes equal to 1. In particular: we should already have this information by the time we get to inductor, because our FakeTensor compute will have branched/guarded on whether any ops performed broadcasting, appropriately. In particular, if we have a tensor with a size value of `(64//((2048//(s3((s2//s3)))))))`, and it happens to be equal to one (and it is used in an op that requires this dim to be broadcasted), FakeTensorProp will have generated a guard: ``` Eq((64//((2048//(s3((s2//s3))))))), 1) ``` I chose the simplest possible way to beef up inductor's checks to know when a given size is equal to 1: loop over the existing shape env guards, and if our current size is a sympy expression on the LHS of one of our `Eq(LHS, 1)` guards, then return True. I'm hoping for feedback on whether or not this approach is reasonable. One better option I could imagine is that our symbolic reasoning should have automatically simplified the size of our tensor down to a constant as part of evaluating that guard. I was originally going to try to do this directly in the shape env, but I ran into a few issues: (1) I wanted to call some version of `set_replacement(expr, 1)`. But `set_replacement()` only accepts plain symbols on the LHS, not expressions (2) in theory I could get this to work if I could rework the above expression to move everything that is not a free variable to the RHS, e.g. `Eq(s2, 32)`. It looks like our existing `try_solve()` logic is... [not quite able](https://github.com/pytorch/pytorch/blob/main/torch/utils/_sympy/solve.py#L27) to do this generally though. Checking the guards feels pretty simple-and-easy. Are we worried that it is too slow to iterate over all the guards? I could also cache the lookup so we only need to iterate over guards that are of the form `Eq(LHS, 1)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/136670 Approved by: https://github.com/ezyang	2024-10-16 22:41:39 +00:00
Tom Ritchford	56379e2c17	Remove an unused variable in _subclasses.fake_impls (#138085 ) * Extracted from https://github.com/pytorch/pytorch/pull/133492 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138085 Approved by: https://github.com/albanD, https://github.com/Skylion007	2024-10-16 22:41:04 +00:00
Yidi Wu	0bfa1bf21d	[scan] support closure (#135602 ) This PR adds an additional_inputs argument to support closures similar to what we've done for while_loop. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135602 Approved by: https://github.com/zou3519 ghstack dependencies: #135600, #135601	2024-10-16 22:28:03 +00:00
Yidi Wu	819d6b139c	[scan] flatten subgraph output and make subgraph inputs to be a slice (#135601 ) This pr introduces two changes: 1. Before this pr, the subgraphs output is ([], []), in this pr, we change it to a flattened list for easier codegen and consistency with other control flow operators. 2. Before the PR, the combine_fn of scan takes a sliced input but keep the sliced dimension. For exmaple, suppose xs = torch.randn(3, 4, 5) and we scan over dim 0, the combine_fn looks like: ``` # x.shape = (1, 4, 5) instead of (4, 5) def combine_fn(carry, x): ... ``` In this PR, we fixed this and also simplify some of the slicing logic. 3. this diff also make sure we always stack ys on fist dimension. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135601 Approved by: https://github.com/zou3519 ghstack dependencies: #135600	2024-10-16 22:28:03 +00:00
Yidi Wu	0437a22d43	[scan] fix typo in signature and remove wrapper (#135600 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135600 Approved by: https://github.com/zou3519	2024-10-16 22:27:59 +00:00
Bin Bao	443472b1ca	[AOTI] Remove explicit abi_compatible setting in tests (#138016 ) Differential Revision: [D64439674](https://our.internmc.facebook.com/intern/diff/D64439674) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138016 Approved by: https://github.com/malfet ghstack dependencies: #137982	2024-10-16 21:35:46 +00:00
Bin Bao	6bc57549f9	[AOTI] Remove non-ABI-compatible tests (#137982 ) Summary: Remove non-ABI-compatible mode tests since ABI-compatible has been turned on as default. Also clean up tests that explicitly set ABI-compatible to True. Differential Revision: [D64439673](https://our.internmc.facebook.com/intern/diff/D64439673) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137982 Approved by: https://github.com/malfet	2024-10-16 21:35:46 +00:00
homorunner	a040c4a260	Use std::move on stringstream to prevent unnecessary copy. (#138065 ) - Takes advantage of C++20's improved handling of move semantics for std::basic_stringbuf. - Reduces unnecessary copying and improves memory efficiency, especially for long formatted strings. Benchmark(proof of concept): https://quick-bench.com/q/qohAu0ARH3vSDyKVsoKEfXOO6BI Pull Request resolved: https://github.com/pytorch/pytorch/pull/138065 Approved by: https://github.com/Skylion007	2024-10-16 21:35:10 +00:00
fduwjj	b72ff35f22	[c10d][ez] Add more inline comments to CUDAEventCache code (#138079 ) Address @kwen2501 's feedback in https://github.com/pytorch/pytorch/pull/138048, add more inline comments to the code. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138079 Approved by: https://github.com/kwen2501 ghstack dependencies: #138040, #138048, #138059	2024-10-16 20:43:28 +00:00
Shangdi Yu	f2c96f5d87	Add AOTI test (#138043 ) Summary: add back the test that's removed in D63916320. It should work now as D64361273 added back the workspace change. Test Plan: CI Differential Revision: D64442054 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138043 Approved by: https://github.com/ColinPeppler, https://github.com/desertfire	2024-10-16 20:41:07 +00:00
Chirag Pandya	f95ddf0b31	[c10d] record world size in log (#138044 ) Summary: Record the world size in log and scuba table. This helps us quickly figure out if there are missing flight recorder files form ranks. Test Plan: Ran locally and noted that size was logged to scuba Differential Revision: D64442949 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138044 Approved by: https://github.com/Skylion007	2024-10-16 20:14:02 +00:00
PyTorch MergeBot	24ee4af86b	Revert "Upgrade distributed test to g4dn instances (T4 GPUs) (#137161 )" This reverts commit 2b7c7a20b9c0e8e7f2773ffc5c9f79c3cae2070b. Reverted https://github.com/pytorch/pytorch/pull/137161 on behalf of https://github.com/kwen2501 due to breaking trunk ([comment](https://github.com/pytorch/pytorch/pull/137161#issuecomment-2417833666))	2024-10-16 20:05:38 +00:00
Henry Tsang	a0a978ce23	[aoti config] add raise_error_on_ignored_optimization (#138035 ) Summary: Unfortunately this means adding another config. Test Plan: ci Differential Revision: D64437699 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138035 Approved by: https://github.com/chenyang78, https://github.com/desertfire	2024-10-16 18:38:47 +00:00
angelayi	f1c741dbe9	Fixes GuardOnDataDependentSymNode error in masked_fill (#137060 ) Fixes [P1621441513](https://www.internalfb.com/phabricator/paste/view/P1621441513) ([ref to internal post](https://fb.workplace.com/groups/6829516587176185/posts/1051474609896021/?comment_id=1055262166183932&reply_comment_id=1056583932718422)) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137060 Approved by: https://github.com/ezyang	2024-10-16 18:16:33 +00:00
Catherine Lee	f173623bb2	[td] try catch exception, do not run td if not results (#138087 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/138087 Approved by: https://github.com/wdvr	2024-10-16 18:04:25 +00:00
Li Yu (ads)	dabe2a3c3b	[Torch] Support meta device in random.fork_rng (#137715 ) Summary: ## Why random.fork_rng doesn't support meta device: ``` [rank0]: File "/data/users/lyu1/fbsource/buck-out/v2/gen/fbcode/581363ebaea3320a/aps_models/ads/tools/memory_estimator/__memory_estimator__/memory_estimator-inplace#link-tree/aps_models/ads/tools/memory_estimator/estimation_dense.py", line 655, in estimate_dense_memory_size [rank0]: losses.sum().backward() [rank0]: File "/data/users/lyu1/fbsource/buck-out/v2/gen/fbcode/581363ebaea3320a/aps_models/ads/tools/memory_estimator/__memory_estimator__/memory_estimator-inplace#link-tree/torch/_tensor.py", line 604, in backward [rank0]: return handle_torch_function( [rank0]: File "/data/users/lyu1/fbsource/buck-out/v2/gen/fbcode/581363ebaea3320a/aps_models/ads/tools/memory_estimator/__memory_estimator__/memory_estimator-inplace#link-tree/torch/overrides.py", line 1718, in handle_torch_function [rank0]: result = mode.__torch_function__(public_api, types, args, kwargs) [rank0]: File "/data/users/lyu1/fbsource/buck-out/v2/gen/fbcode/581363ebaea3320a/aps_models/ads/tools/memory_estimator/__memory_estimator__/memory_estimator-inplace#link-tree/torch/utils/_device.py", line 106, in __torch_function__ [rank0]: return func(args, kwargs) [rank0]: File "/data/users/lyu1/fbsource/buck-out/v2/gen/fbcode/581363ebaea3320a/aps_models/ads/tools/memory_estimator/__memory_estimator__/memory_estimator-inplace#link-tree/torch/_tensor.py", line 613, in backward [rank0]: torch.autograd.backward( [rank0]: File "/data/users/lyu1/fbsource/buck-out/v2/gen/fbcode/581363ebaea3320a/aps_models/ads/tools/memory_estimator/__memory_estimator__/memory_estimator-inplace#link-tree/torch/autograd/__init__.py", line 347, in backward [rank0]: _engine_run_backward( [rank0]: File "/data/users/lyu1/fbsource/buck-out/v2/gen/fbcode/581363ebaea3320a/aps_models/ads/tools/memory_estimator/__memory_estimator__/memory_estimator-inplace#link-tree/torch/autograd/graph.py", line 825, in _engine_run_backward [rank0]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [rank0]: File "/data/users/lyu1/fbsource/buck-out/v2/gen/fbcode/581363ebaea3320a/aps_models/ads/tools/memory_estimator/__memory_estimator__/memory_estimator-inplace#link-tree/torch/utils/checkpoint.py", line 1125, in unpack_hook [rank0]: frame.recompute_fn(args) [rank0]: File "/data/users/lyu1/fbsource/buck-out/v2/gen/fbcode/581363ebaea3320a/aps_models/ads/tools/memory_estimator/__memory_estimator__/memory_estimator-inplace#link-tree/torch/utils/checkpoint.py", line 1507, in recompute_fn [rank0]: with torch.random.fork_rng( [rank0]: File "/data/users/lyu1/fbsource/buck-out/v2/gen/fbcode/581363ebaea3320a/aps_models/ads/tools/memory_estimator/__memory_estimator__/memory_estimator-inplace#link-tree/runtime/lib/python3.10/contextlib.py", line 135, in __enter__ [rank0]: return next(self.gen) [rank0]: File "/data/users/lyu1/fbsource/buck-out/v2/gen/fbcode/581363ebaea3320a/aps_models/ads/tools/memory_estimator/__memory_estimator__/memory_estimator-inplace#link-tree/torch/random.py", line 153, in fork_rng [rank0]: raise RuntimeError( [rank0]: RuntimeError: torch has no module of `meta`, you should register a module by `torch._register_device_module`. ``` This blocks us from running backward() on model with checkpoint enabled in meta mode. ## What This diff handles the case of meta device in random.fork_rng. Test Plan: Tested with toy model which has checkpoint on its module: P1641201046 Differential Revision: D64161410 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137715 Approved by: https://github.com/kit1980	2024-10-16 18:00:39 +00:00
Shangdi Yu	a47bb4a393	Fix autocast for non-strict export (#137495 ) Summary: add testing for autocast and set_grad nodes for export_for_training. In export_for_training, we do not wrap the autocast and set_grad node in to HOP, but we should still have the set_grad_enabled/autocast nodes. add support for autocast in non-strict export. Previously, `_enter_autocast` and `_exit_autocast` nodes don't show up in the export graph when we use `strict=False`. - In autocast's enter and exit function, we dispatch to `PreDispatchTorchFunctionMode.__torch_function__`. if we have PreDispatchTorchFunctionMode in our function_mode_stack, the call stack looks like below. This is mostly the same call stack as strict mode, except strict mode enters [here](https://www.internalfb.com/code/fbsource/[0d4f1135cacdb26c6e01d5dce1ce52a15d61ee48]/xplat/caffe2/torch/_dynamo/variables/ctx_manager.py?lines=806). ``` - torch.amp.autocast.__enter__()'s torch.overrides.handle_torch_function - torch.fx.experimental.proxy_tensor.TorchFunctionMetadataMode.__torch_function__ - torch.amp._enter_autocast()'s torch.overrides.handle_torch_function - PreDispatchTorchFunctionMode.__torch_function__ ``` - in `PreDispatchTorchFunctionMode.__torch_function__`, we create the autocast nodes. - to match the strict mode behavior, we let the input node to the `_exist_autocast` node be the corresponding `_enter_autocast` node. This requires us to maintain a stack in `PreDispatchTorchFunctionMode`. Test Plan: ``` buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test:test_export -- -r test_export_with_autocast buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test:test_export -- -r test_export_with_set_grad ``` Differential Revision: D64016023 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137495 Approved by: https://github.com/bdhirsh	2024-10-16 17:39:00 +00:00
Zheng, Zhaoqiong	7ba706c74e	update get start xpu (#137479 ) 1. respect the comment from the community, downgrade the "Beta" to "Prototype" for the first xpu release with wheel 2. add wheels installation of torchaudio & torchvision for nightly on Windows Pull Request resolved: https://github.com/pytorch/pytorch/pull/137479 Approved by: https://github.com/atalman, https://github.com/malfet	2024-10-16 17:36:29 +00:00
fduwjj	7e704c2073	[c10d] Add unit test for CUDAEventCache to ensure caching is working (#138059 ) We created a simple test to validate the cache is indeed working and when the cache is indeed used up. I revert the fix in (https://github.com/pytorch/pytorch/pull/138040) and the test indeed failed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138059 Approved by: https://github.com/kwen2501 ghstack dependencies: #138040, #138048	2024-10-16 17:34:57 +00:00
PyTorch MergeBot	dd32a32cb6	Revert "Expose option to disable CRC-32 computation during `torch.save` (#137735 )" This reverts commit 534fa96f2d9a4feb1dcdfaecb3d73990db60f819. Reverted https://github.com/pytorch/pytorch/pull/137735 on behalf of https://github.com/clee2000 due to failing internally D64438525, probably needs gating ([comment](https://github.com/pytorch/pytorch/pull/137735#issuecomment-2417412264))	2024-10-16 17:03:06 +00:00
Ke Wen	2b7c7a20b9	Upgrade distributed test to g4dn instances (T4 GPUs) (#137161 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137161 Approved by: https://github.com/seemethere, https://github.com/eqy	2024-10-16 16:42:57 +00:00
Tugsbayasgalan Manlaibaatar	0a6c40faba	Fix constant returning (#137993 ) When the constants are used twice in the exported graph (second one is returned as output), the lifting constant pass doesn't account for the second one being the output. THis PR fixes that. Differential Revision: [D64406108](https://our.internmc.facebook.com/intern/diff/D64406108/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137993 Approved by: https://github.com/avikchaudhuri	2024-10-16 16:42:09 +00:00
Scott Wolchok	189c95457d	[PyTorch] Don't hardcode 4 * Vec::size() in vectorized_reduction (#138014 ) This will break once we support 128-bit vectors, and there's no reason to do it. Differential Revision: [D64421982](https://our.internmc.facebook.com/intern/diff/D64421982/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138014 Approved by: https://github.com/malfet, https://github.com/Skylion007 ghstack dependencies: #137722	2024-10-16 16:41:59 +00:00
Scott Wolchok	a12c859b00	[PyTorch] Check `defined(__aarch64__) && !defined(CPU_CAPABILITY_SVE256)` instead of `defined(CPU_CAPABILITY_NEON)` (#137722 ) The CPU_CAPABILITY system is for rebuilding kernels multiple times with different vector ISA targets. CPU_CAPABILITY_NEON was not being used for that, just as an extra flag for inductor. As a result, CPU_CAPABILITY_NEON-gated code was unnecessarily unavailable outside inductor. Fixes #137704 Differential Revision: [D64197046](https://our.internmc.facebook.com/intern/diff/D64197046/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137722 Approved by: https://github.com/jgong5, https://github.com/malfet	2024-10-16 16:41:59 +00:00
PyTorch MergeBot	361f42bc42	Revert "[compiled autograd] Compiled autograd configs in TLS (#137821 )" This reverts commit 9aba0b91c8df4a15654f9ccc02abca31bdd81650. Reverted https://github.com/pytorch/pytorch/pull/137821 on behalf of https://github.com/wdvr due to Reverting this for now, it is failing test_public_bindings in trunk ([comment](https://github.com/pytorch/pytorch/pull/137821#issuecomment-2417351788))	2024-10-16 16:38:29 +00:00
Tom Ritchford	af27f7888b	[dynamo] Remove an unused variable in AOTDispatchAutograd (#137989 ) * Extracted from https://github.com/pytorch/pytorch/pull/133492 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137989 Approved by: https://github.com/Skylion007, https://github.com/albanD	2024-10-16 16:37:19 +00:00
Nikita Shulga	753ba5d30a	Move basic dependencies install to requirements-ci (#138024 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138024 Approved by: https://github.com/huydhn ghstack dependencies: #137991, #137992, #138023	2024-10-16 16:21:33 +00:00
William Wen	4c8718d8e7	[dynamo] add torch.compiler.set_stance (#137504 ) Attempt # 2 at https://github.com/pytorch/pytorch/pull/132926 to implement https://github.com/pytorch/pytorch/issues/123771. Implement a new `torch.compiler.set_stance` function that can force `torch.compile` regions to run eagerly. See added tests for usage examples. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137504 Approved by: https://github.com/yf225, https://github.com/jansel	2024-10-16 16:18:25 +00:00
fduwjj	960c3bff98	[c10d] Refactor CUDAEventCache Create to use deque rather than stack (#138048 ) We used a LIFO stack to store the CudaEvent in the cache. ,Somehow we like FIFO deque better so aside from improving the readability of the code, we use a deque instead. As @wconstab pointed out, both methods are equally correct because the moment we put the event into stack/deque, the event is already ready for reuse, this change mostly is a preference change not trying to fix anything. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138048 Approved by: https://github.com/kwen2501 ghstack dependencies: #138040	2024-10-16 14:44:39 +00:00
Tom Ritchford	932ae131fb	Remove an unused variable in _inductor/codegen/simd.py (#138000 ) * Extracted from https://github.com/pytorch/pytorch/pull/133492 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138000 Approved by: https://github.com/Skylion007	2024-10-16 13:54:21 +00:00
Isuru Fernando	f3d7a02716	Avoid some dangling reference warnings (#132535 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132535 Approved by: https://github.com/aaronenyeshi	2024-10-16 13:41:12 +00:00
Tom Ritchford	0c63de9755	[dynamo] Remove an unused variable in AutogradFunctionApplyVariable (#137985 ) ---- * Extracted from https://github.com/pytorch/pytorch/pull/133492 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137985 Approved by: https://github.com/zou3519	2024-10-16 13:08:45 +00:00
Tom Ritchford	15722debfb	Remove two unused variables in _functorch/partitioners.py (#137998 ) * Extracted from https://github.com/pytorch/pytorch/pull/133492 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137998 Approved by: https://github.com/Skylion007	2024-10-16 10:58:31 +00:00
Simon Fan	9aba0b91c8	[compiled autograd] Compiled autograd configs in TLS (#137821 ) Multithreaded doesn't work yet, this adds python side TLS only for the python side state Pull Request resolved: https://github.com/pytorch/pytorch/pull/137821 Approved by: https://github.com/jansel, https://github.com/yf225 ghstack dependencies: #137953	2024-10-16 09:28:32 +00:00
Simon Fan	af91661368	[compiled autograd] directly use python Logger class in cpp (#137953 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137953 Approved by: https://github.com/jansel, https://github.com/yf225	2024-10-16 09:28:32 +00:00
amathewc	7f88bf96f9	test_execution_trace.py: Use instantiate_device_type_tests to run GPU tests on HPU as well (#133975 ) MOTIVATION We recently integrated support for Intel Gaudi devices (identified as 'hpu') into the common_device_type framework via the pull request at https://github.com/pytorch/pytorch/pull/126970. This integration allows tests to be automatically instantiated for Gaudi devices upon loading the relevant library. Building on this development, the current pull request extends the utility of these hooks by adapting selected CUDA tests to operate on Gaudi devices. Additionally, we have confirmed that these modifications do not interfere with the existing tests on CUDA devices. CHANGES - Add support for HPU devices within the payload function. - Use instantiate_device_type_tests with targeted attributes to generate device-specific test instances. - Expand the supported_activities() function to include checks for torch.profiler.ProfilerActivity.HPU. - Apply skipIfHPU decorator to bypass tests that are not yet compatible with HPU devices. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133975 Approved by: https://github.com/briancoutinho, https://github.com/aaronenyeshi	2024-10-16 07:53:06 +00:00
cyyever	deaf0418b2	[2/N] Fix clang-tidy warnings in torch/csrc/api/ (#136998 ) Follows #134545 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136998 Approved by: https://github.com/ezyang	2024-10-16 07:50:59 +00:00
Shuqiang Zhang	f4158558aa	[c10d] disable watchdog thread in blockingWait mode (#138001 ) Summary: Blocking wait mode is not widely used, probably useful in debugging. in blockingWait mode, we don't need to enable the watchdog thread to check the timeout or nccl error because the main thread would throw an exception if error happens and it is obvious to user which work fails and its user's responsibility to handle the exception. Test Plan: CI Pull Request resolved: https://github.com/pytorch/pytorch/pull/138001 Approved by: https://github.com/fduwjj, https://github.com/c-p-i-o ghstack dependencies: #137799	2024-10-16 07:42:22 +00:00
PyTorch MergeBot	78632b97b1	Revert "Upgrade distributed test to g4dn instances (T4 GPUs) (#137161 )" This reverts commit f43c4d28b8f955fe1f2b80f193815edadc95507b. Reverted https://github.com/pytorch/pytorch/pull/137161 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but it seems another failure showing up after the upgrade ([comment](https://github.com/pytorch/pytorch/pull/137161#issuecomment-2415941159))	2024-10-16 07:26:34 +00:00
Jason Ansel	7480e6938d	[inductor] Add LoopBody.op_counts (#137945 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137945 Approved by: https://github.com/eellison ghstack dependencies: #137946	2024-10-16 06:35:10 +00:00
Jason Ansel	0d7b2118ed	[inductor] Refactor triton dtype helpers (#137946 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137946 Approved by: https://github.com/eellison	2024-10-16 06:35:10 +00:00
Huy Do	97f7fc1d31	Support retry when building Docker images (#138012 ) Similar to https://github.com/pytorch/test-infra/pull/5759, I'm seeing flaky network error from time to time when building Docker images, for example https://github.com/pytorch/pytorch/actions/runs/11352439248/job/31575206417. So, adding retrying to mitigate this class of flaky failures. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138012 Approved by: https://github.com/atalman	2024-10-16 06:10:41 +00:00
fduwjj	084657e012	[c10d] Fix data corruption bug after CUDAEventCache is enabled (#138040 ) Here is why we see using `CUDAEventCache` cause crash and data corruption. 1. The deleter is doing its job and append the job the stack. 2. In create, instead of getting a reference, we are getting a copy of eventsArray_[i] (which is a std::vector). This is bad because we didn't really remove the element from the stack. While we thought we already pop up the last one from the stack, but it turns out the last one is still in the stack; we end up reusing the same event again and again. What's worse, since we keep adding new events to the stack, this will eventually explode the stack and a crash happens. Fix is easy, just get a reference. Local torchtitan run see a non-Nan loss. Also we want to use a deque instead of a stack, and refactor the code a bit to make it more readable. (in a separate PR) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138040 Approved by: https://github.com/kwen2501, https://github.com/shuqiangzhang	2024-10-16 05:20:29 +00:00
Ke Wen	f43c4d28b8	Upgrade distributed test to g4dn instances (T4 GPUs) (#137161 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137161 Approved by: https://github.com/seemethere, https://github.com/eqy	2024-10-16 05:03:08 +00:00
Nikita Shulga	60b4858977	[BE][Docker] Don't update scikit-learn (#138023 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138023 Approved by: https://github.com/huydhn ghstack dependencies: #137991, #137992	2024-10-16 05:01:40 +00:00
Nikita Shulga	7f6e85bb93	[BE] Move numpy installation logic to `requirements-ci.txt` (#137992 ) And slightly adjust versioning logic, as current one seems to exist to hide version conflicts: - 1.21.2 for Python-3.9 - 1.24.2 for Python-3.10 (to resolve conflict with numba-0.55.2) - 1.26.2 for Python-3.11 or 3.12 - 2.1.2 for Python-3.13 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137992 Approved by: https://github.com/Skylion007, https://github.com/huydhn ghstack dependencies: #137991	2024-10-16 04:30:29 +00:00
Nikita Shulga	12f4d91e84	Enable Python-3.13 builds on MacOS (#138037 ) All logic changes happen in builder repo, namely: - `a01e87535b` - `bcd0972459` Pull Request resolved: https://github.com/pytorch/pytorch/pull/138037 Approved by: https://github.com/huydhn ghstack dependencies: #138041	2024-10-16 04:24:12 +00:00
Yu, Guangye	66b39fd474	refactor KERNEL_MPS via resuing KERNEL (#137831 ) # Motivation Reuse `KERNEL` to simplify `KERNEL_MPS` for mps autocast code. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137831 Approved by: https://github.com/malfet	2024-10-16 03:54:13 +00:00
Yu, Guangye	2c94c54f10	Export XPU libs to be public (#136974 ) # Motivation Export XPU-related libs to be public. Now they are included in `TORCH_LIBRARIES` Pull Request resolved: https://github.com/pytorch/pytorch/pull/136974 Approved by: https://github.com/EikanWang, https://github.com/malfet	2024-10-16 03:41:01 +00:00
Yifu Wang	80f3ee41dc	[SymmetricMemory] fix incorrect numel caculations that are using int as std::accumulate's accumulator (#138038 ) Fixes https://github.com/pytorch/pytorch/pull/137567 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138038 Approved by: https://github.com/weifengpy	2024-10-16 03:34:26 +00:00
Howard Huang	75109682b6	[Pipelining] Refactor Interleaved1F1B and ZeroBubble (#137783 ) NOTE: this PR removes `ScheduleFlexibleInterleaved1F1B`, let me know if theres any concerns. `ScheduleFlexibleInterleaved1F1B` is a superset of `Interleaved1F1B` and uses most of the same implementation, but relaxes the condition that `n_microbatches % pp_size == 0`. This is refactors the implementation into `Interleaved1F1B` and then removes it since it is confusing to have both schedules with similar names. This also refactors the zero bubble logic to belong in the `ZeroBubble` schedule class. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137783 Approved by: https://github.com/wconstab	2024-10-16 03:05:14 +00:00
Adnan Akhundov	809ff3b274	Add host-side Triton TMA support to Dynamo (#137677 ) This adds Dynamo tracing support for the host-side Triton TMA API (see `create_2d_tma_descriptor` calls on the host in the [Triton tutorial](https://triton-lang.org/main/getting-started/tutorials/09-persistent-matmul.html#sphx-glr-getting-started-tutorials-09-persistent-matmul-py)). A few notes: - Here we assume the availability of the host-side TMA API added to upstream Triton in https://github.com/triton-lang/triton/pull/4498. As of time of writing, this is not a part of the PT2 OSS Triton pin (although back-ported internally). OSS Triton pin update should be done in December 2024. - To capture the chain of calls `t.data_ptr() --> create_{1d,2d}_tma_descriptor(ptr, ...) --> kernel[grid](tma_desc, ...)`, we add three new variable trackers: `DataPtrVariable`, `CreateTMADescriptorVariable` (for the function), `TMADescriptorVariable` (for TMA descriptor object). This is to maintain the path back from the Triton kernel to the Tensor from which the TMA descriptor has been created. - The newly introduced variables have `reconstruct` methods used in case of graph breaks. - The `tma_descriptor_metadata` extracted from the captured `create_{1d,2d}_tma_descriptor` calls is propagated through the HOPs in Dynamo and AOTAutograd to be used by the downstream compiler (e.g., Inductor). See the unit tests for how the captured HOP arguments look like. - In the Dynamo-captured fx graph, we replace the TMA descriptor arguments of the Triton kernel by the underlying Tensors, to be able to track the input/output relationships in terms of Tensors. - In the Triton kernel mutation analysis pass (in AOTAutograd), we use the `tt.experimental_descriptor_store` TTIR op to detect mutations of the underlying tensors via TMA descriptors. So that downstream AOTAutograd can perform functionalizations as required. - JIT Inductor and AOT Inductor support will be implemented in follow-up PRs. Differential Revision: [D64404928](https://our.internmc.facebook.com/intern/diff/D64404928) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137677 Approved by: https://github.com/zou3519	2024-10-16 02:18:48 +00:00
Nikita Shulga	dd2ae7d0c9	[BE] Use `x in [foo, bar]` (#138041 ) As shorthand for `x == foo or x == bar` And `x not in [foo, bar]` as shorthand for `x != foo and x != bar` Pull Request resolved: https://github.com/pytorch/pytorch/pull/138041 Approved by: https://github.com/huydhn	2024-10-16 01:57:37 +00:00
Simon Fan	64ccebd2e0	update labeler for module: compiled autograd (#137954 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137954 Approved by: https://github.com/yf225	2024-10-16 01:56:21 +00:00
Nichols A. Romero	aa28062169	[ROCm] TunableOp more unit test follow-up - Part 2 (#134517 ) More unit tests to cover TunableOp functionality. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134517 Approved by: https://github.com/jeffdaily	2024-10-16 01:49:47 +00:00
zeshengzong	7fa7333299	[Distributed][Test] Fix todo in distributed test files (#136836 ) Refactor distributed test code: - Fix TODO: (rohan-varma): remove model - Fix TODO: add comments for TestTraverse - Migrate deprecated method call `load_state_dict` and `save_state_dict` Pull Request resolved: https://github.com/pytorch/pytorch/pull/136836 Approved by: https://github.com/kwen2501	2024-10-16 01:15:12 +00:00
Shuqiang Zhang	a1b22e369b	[c10d] add an API to get the future result(success or failure) of a collective and customize error handling (#137799 ) Summary: This PR is trying to let users to know what exact collective call from the python thread is failing, and customize their own error handling function, instead of watchdog thread crashing everything. This is potentially very useful in fault tolerant training, in which we can have in-process restart. E.g., when an nccl error is detected, users can potentially abort comms, re-init comms and go back to the previous check pointed step and try again, instead of crashing the whole job. This is to allow users to check the status of each collective call, using the ivalue::future libs in PT core. This also allows users to attach its customized failure handling functions by: work.get_future_result().then(erro_handling_func) Note that the above call is also non-blocking for CPU thread Test Plan: Added a new test: test_get_future_result to verify the workResutl is correctly propagated to the users Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/137799 Approved by: https://github.com/fduwjj, https://github.com/wconstab	2024-10-16 00:20:09 +00:00
Nikita Lutsenko	8d9c9727c0	aten \| Fix set but unused variables warning in release builds. (#138008 ) Summary: Fixing a warning that happens only in release builds. Test Plan: Sandcastle + dependent diffs Reviewed By: boguscoder Differential Revision: D64415854 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138008 Approved by: https://github.com/boguscoder, https://github.com/Skylion007	2024-10-16 00:05:39 +00:00
Edward Z. Yang	46ec4ad021	Add code pointer to internal Meta implementation (#137984 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/137984 Approved by: https://github.com/albanD	2024-10-15 23:35:22 +00:00
PyTorch MergeBot	4557f6e339	Revert "[Dynamo] Disable torch function compilation during guard execution and in compiled bytecode (#137669 )" This reverts commit bf0b67059882933574f71a3b11b2f0127915ee5b. Reverted https://github.com/pytorch/pytorch/pull/137669 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but it is failing test_public_bindings in trunk, maybe a landrace ([comment](https://github.com/pytorch/pytorch/pull/137669#issuecomment-2415331274))	2024-10-15 23:22:58 +00:00
Animesh Jain	19665f4619	[fake_tensor][cache] Supports ops with tuple of output tensors (#137935 ) This is needed for invoke_subgraph work. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137935 Approved by: https://github.com/masnesral	2024-10-15 22:15:07 +00:00
Yifu Wang	5d5783a263	Improve the scheduling of _pipelined_multi_all_gather_and_consume (#137850 ) ``` Parallelization strategy: after each rank copies its shard into its local p2p buffer, every rank issues independent p2p copy -> shard_consumer sequences to two streams. In addition to computation/communication overlapping, the strategy allows for computation/computation overlapping, greatly reducing quantization inefficiency. Notation: - "mv" for the copy to local buffer - "cp" for p2p copies - "b" for barriers Constraints: - The GPU scheduler may or may not overlap "mv" with the first shard_consumer. - "cp" from different streams cannot overlap. Ideal scenario 0 - "mv" overlaps with the first shard_consumer: stream 0: [ shard_consumer ][ cp ][ shard_consumer ] stream 1: [ mv ][b][ cp ][ shard_consumer ] Ideal scenario 1 - "mv" is scheduled before the first shard_consumer: stream 0: [ shard_consumer ][ cp ][ shard_consumer ] stream 1: [ mv ][b][ cp ][ shard_consumer ] Suboptimal scenario 0 - "mv" is scheduled after the first shard_consumer: stream 0: [ shard_consumer ] [ cp ][ shard_consumer ] stream 1: [ mv ][b][ cp ][ shard_consumer ] Suboptimal scenario 0 - "b" is scheduled after the first shard_consumer: stream 0: [ shard_consumer ] [ cp ][ shard_consumer ] stream 1: [ mv ] [b][ cp ][ shard_consumer ] We haven't yet figured out a way to ensure "mv" and "b" are either overlapped with or scheduled before the first shard_consumer. Thus, to prevent suboptimal scenarios, we are giving up the chance to overlap "mv" and "b" with the first shard_consumer for now. ``` This PR improves the scheduling for mm kernels with high SM utilization. The GPU scheduler tends to not overlap local DtoD copies with such kernels, which leads to suboptimal scheduling. The following is an example of pipelining PyTorch's cutlass-based, row-wise scaling fp8 kernel: Before this PR: <img width="298" alt="image" src="https://github.com/user-attachments/assets/81e0a7f4-18ee-47c6-b258-04fdaca7a6a2"> With this PR: <img width="253" alt="image" src="https://github.com/user-attachments/assets/982de5a8-da1e-4a8f-b67e-c9c869b0a77f"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/137850 Approved by: https://github.com/weifengpy ghstack dependencies: #137643, #137738, #137805, #137836	2024-10-15 21:35:14 +00:00
Yifu Wang	2ae1a4caa1	Improve the scheduling of _pipelined_produce_and_all2all (#137836 ) ``` Parallelization strategy: every rank issues independent compute -> barrier -> p2p copy sequences on two streams. In addition to computation/communication overlapping, the strategy allows for computation/computation overlapping, greatly reducing quantization inefficiency. Ideally, stream activities would look like this ("b" for barriers, "cp" for p2p copies): [rank 0] stream 0: [ chunk_producer ][b][ cp ][ chunk_producer ][b][ cp ] stream 1: [ chunk_producer ][b][ cp ][ chunk_producer ][b][ cp ] [rank 1] stream 0: [ chunk_producer ][b][ cp ][ chunk_producer ][b][ cp ] stream 1: [ chunk_producer ][b][ cp ][ chunk_producer ][b][ cp ] Note that the barriers synchronize streams with the same ID across ranks. They don't synchronize streams on the same rank. Since the work on both streams is independent, there's no guarantee that the chunk_producer from stream 0 or stream 1 will be scheduled first. If there is a scheduling mismatch across ranks, the barrier forces all ranks to wait for the slowest. When scheduling mismatches occur among ranks, the stream activities might look like this (note that p2p copies from different streams cannot overlap with each other): [rank 0] stream 0: [ chunk_producer ][b ][ cp ][ chunk_producer ][b ][ cp ] stream 1: [ chunk_producer ][b] [ cp ][ chunk_producer ][b] [ cp ] [rank 1] stream 0: [ chunk_producer ][b] [ cp ][ chunk_producer ][b] [ cp ] stream 1: [ chunk_producer ][b ][ cp ][ chunk_producer ][b ][ cp ] To prevent this, we need to ensure that the chunk_producer on stream 1 gets scheduled first on every rank. Without access to the underlying kernels, CUDA offers no API to control the scheduling order of two independent, overlapping kernels. Our solution is to issue a small sleep kernel in stream 0. The sleep duration is insignificant, but having an extra task in stream 0 will almost guarantee that the chunk_producer on stream 1 gets scheduled first. Once the first chunk_producer is scheduled in the correct order, there's very little room for the scheduling order of subsequent kernels to be inconsistent across ranks. ``` Currently, we perform stream synchronization to ensure scheduling order. The stream synchronization has no bearing on correctness, but prevents inconsistent scheduling orders across ranks. Without the stream synchronization, ranks may have inconsistent scheduling order, and the barriers cause all ranks to wait for the slowest rank: <img width="379" alt="image" src="https://github.com/user-attachments/assets/ffb97e76-7e19-4449-b121-83c32ec3e91d"> With stream synchronization, the inconsistent scheduling order issue is addressed, but we lose compute/compute overlapping (this is the state before this PR): <img width="378" alt="image" src="https://github.com/user-attachments/assets/4cb76246-625f-4fc1-b49a-823ae46d3f23"> With this PR, we get both consistent scheduling order across ranks and compute/compute overlap: <img width="327" alt="image" src="https://github.com/user-attachments/assets/51ab1bdc-4f60-46e0-b53c-6d208e2d4888"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/137836 Approved by: https://github.com/weifengpy ghstack dependencies: #137643, #137738, #137805	2024-10-15 21:35:14 +00:00
Yifu Wang	ef541c1a65	[fused_all_gather_scaled_matmul] support rowwise scaling (#137805 ) This PR add support for `A_scale` to be row-wise scale. The op can automatically detect whether the row-wise scale is sharded or replicated. When the row-wise scale is sharded, the op would all-gather the scale in a pipelined fashion. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137805 Approved by: https://github.com/weifengpy ghstack dependencies: #137643, #137738	2024-10-15 21:35:14 +00:00
Yifu Wang	05edaeaded	[fused_scaled_matmul_reduce_scatter] support rowwise scaling (#137738 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137738 Approved by: https://github.com/Chillee, https://github.com/weifengpy ghstack dependencies: #137643	2024-10-15 21:35:14 +00:00
Yifu Wang	91bc9dc2c9	[SymmetricMemory] implement timeout for barrier(), put_signal() and wait_signal() (#137643 ) Suggested by @lw for better safety/reliability. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137643 Approved by: https://github.com/weifengpy, https://github.com/lw	2024-10-15 21:35:14 +00:00
Jane Xu	eaec72d1e6	Link directly to new Custom Ops Landing Page (#137933 ) e.g., click on first link in https://docs-preview.pytorch.org/pytorch/pytorch/137933/library.html#testing-custom-ops Pull Request resolved: https://github.com/pytorch/pytorch/pull/137933 Approved by: https://github.com/zou3519	2024-10-15 21:18:21 +00:00
Tristan Rice	aef4317ec8	[c10d] socket: retry connection timeout failures (#138003 ) This will retry connection timeout failures up to the timeout duration. Under heavy load the server may not be able to immediately accept the connection. In such a case we do want to retry the connection rather than fall back to ipv4 for the remaining of the connection timeout. The connection timeout here is not the same as the c10d timeout which appears to be higher. We could adjust the linux timeout directly but using the c10d retry loop keeps things more consistent and gives us things like exponential backoff, logs, etc. Example failure: ``` socket.cpp:752] [c10d] The client socket has failed to connect to [...]:29400 (errno: 110 - Connection timed out). socket.cpp:752] [c10d] The IPv4 network addresses of (..., 29400) cannot be retrieved (gai error: -2 - Name or service not known). ... repeats ipv4 connection failure ``` From Linux man page: https://man7.org/linux/man-pages/man2/connect.2.html ``` ETIMEDOUT Timeout while attempting connection. The server may be too busy to accept new connections. Note that for IP sockets the timeout may be very long when syncookies are enabled on the server. ``` Test plan: CI for backwards compatibility Pull Request resolved: https://github.com/pytorch/pytorch/pull/138003 Approved by: https://github.com/c-p-i-o, https://github.com/fduwjj, https://github.com/rsdcastro	2024-10-15 21:17:05 +00:00
Michael Lazos	bf0b670598	[Dynamo] Disable torch function compilation during guard execution and in compiled bytecode (#137669 ) Fixes https://github.com/pytorch/pytorch/issues/114369 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137669 Approved by: https://github.com/anijain2305	2024-10-15 20:52:58 +00:00
Matthew Levy	28a521e29a	[fuzzing result][fuzz_torch_jit_lite_interpreter] read-heap-buffer-overflow (size 4) in c10::IValue::IValue() (#137924 ) Summary: Calling `pop()` on empty stack Test Plan: CI Differential Revision: D64332420 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137924 Approved by: https://github.com/Skylion007	2024-10-15 20:42:47 +00:00
Xu Han	3ecec0c90c	skip lintrunner install on Windows. (#137981 ) `lintrunner` is not support Windows x64. Ref: https://pypi.org/project/lintrunner/#files When we install python dependency by `pip install -r requirements.txt` on Windows x64, it will failed on `lintrunner`. <img width="887" alt="image" src="https://github.com/user-attachments/assets/e3815177-e893-41ae-96af-8b39d12f74a7"> Solution: skip install `lintrunner` on Windows. Reference doc: https://peps.python.org/pep-0508/#environment-markers Pull Request resolved: https://github.com/pytorch/pytorch/pull/137981 Approved by: https://github.com/albanD Co-authored-by: albanD <desmaison.alban@gmail.com>	2024-10-15 20:37:26 +00:00
Ke Wen	35fc24fbed	[PGNCCL] Fix bugs in non-blocking mode (#137741 ) ### Fix 1: Throw async error during init wait Previously we just busy wait for `ncclSuccess`, if the nonblocking init encountered error, we never report that. Added detection of async error via `ncclGetAsyncError`. ### Fix 2: Add wait after comm split ``` // After calling ncclCommSplit in non-blocking mode, we should wait for the // source communicator to be out of ncclInProgress state. // Reason 1: // it's unsafe to call new operations on the parent comm while it's in // ncclInProgress state. // Reason 2: // as of NCCL 2.23, the ptr value of child comm will not be filled until the // state of parent comm is ncclSuccess. This may change in the future. See: // https://github.com/NVIDIA/nccl/issues/1472 ``` This wait does not mean the child comm is ready for use, neither does it block till that point. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137741 Approved by: https://github.com/shuqiangzhang	2024-10-15 20:35:39 +00:00
Nikita Lutsenko	370d66d7dd	aten/buck \| Appropriately convert clang => msvc compiler_flags. (#137944 ) Summary: fPIC is not available in clang on Windows - filter it out. Also configure the flags appropriately for MSVC. Reviewed By: rameshviswanathan Differential Revision: D64365660 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137944 Approved by: https://github.com/mwdavis84, https://github.com/ChristianK275, https://github.com/boguscoder	2024-10-15 20:21:01 +00:00
Alex Baden	487873f7ca	[Inductor]: Support updated Triton `AttrsDescriptor` (#137757 ) The Triton `AttrsDescriptor` object was refactored in https://github.com/triton-lang/triton/pull/4734. These changes add support for the new `AttrsDescriptor` while maintaining backwards compatibility with the existing version. The main changes are different names for the initialized of the descriptor parameters, and a creation via a static method instead of the class constructor. Depends on #137458 which removes some unused logic around the old descriptor. Those changes make this PR cleaner, but if for some reason that old logic is still used I can make adjustments. Use of the new `AttrsDescriptor` depends on https://github.com/triton-lang/triton/pull/4888 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137757 Approved by: https://github.com/jansel	2024-10-15 19:34:59 +00:00
Mikayla Gawarecki	534fa96f2d	Expose option to disable CRC-32 computation during `torch.save` (#137735 ) Option only works in open source, not internal Pull Request resolved: https://github.com/pytorch/pytorch/pull/137735 Approved by: https://github.com/albanD	2024-10-15 19:30:02 +00:00
Andrew Gu	3cc8c8b944	[FSDP2] Add `set_unshard_in_backward(bool)` (#137922 ) For some expert use cases, the user knows some parameters are not required for backward, so we can skip the unshard in backward. One example is the embedding weight. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137922 Approved by: https://github.com/weifengpy	2024-10-15 19:11:14 +00:00
Laith Sakka	60cf72e028	enable auto functionalize v2 by default (#136685 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136685 Approved by: https://github.com/zou3519 ghstack dependencies: #137760	2024-10-15 19:04:42 +00:00
Laith Sakka	05b6200ccd	Do not compute base in export mode (#137760 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137760 Approved by: https://github.com/zou3519, https://github.com/bdhirsh	2024-10-15 19:04:42 +00:00
drisspg	f5e38f65c5	[FlexAttention] Support training bias for eager (#136910 ) (#137526 ) This PR is Part 2 of the implementation started in https://github.com/pytorch/pytorch/pull/136910, rolled in the updates from https://github.com/pytorch/pytorch/pull/137451. Original was reverted due to calls to #@torch.libary at `import torch` time, so added a call to register at first call to `ModIndex` Pull Request resolved: https://github.com/pytorch/pytorch/pull/137526 Approved by: https://github.com/Chillee, https://github.com/zou3519	2024-10-15 18:55:22 +00:00
PyTorch MergeBot	cd292908e5	Revert "Make c10::string_view an alias of std::string_view (#130417 )" This reverts commit c48fe8901114aa2b0a9c2d77f915a2ad8ab2098b. Reverted https://github.com/pytorch/pytorch/pull/130417 on behalf of https://github.com/clee2000 due to breaking some internal tests, probably usages of string_view that need to be changed? ([comment](https://github.com/pytorch/pytorch/pull/130417#issuecomment-2414775064))	2024-10-15 18:55:09 +00:00
Siddhartha Menon	e1e6417d4c	Add SVE implementation of embedding_lookup_idx (#133995 ) Adds an accelerated version of the embedding_lookup_idx perfkernels. This is done via a python codegen file similarly to `caffe2/perfkernels/hp_emblookup_codegen.py` Pull Request resolved: https://github.com/pytorch/pytorch/pull/133995 Approved by: https://github.com/malfet, https://github.com/huydhn	2024-10-15 18:52:44 +00:00
Nikita Shulga	b09d6f3a7d	[EZ][BE] Delete 3.8 specific checks (#137991 ) As we no longer support 3.8 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137991 Approved by: https://github.com/Skylion007	2024-10-15 18:45:49 +00:00
Aaron Orenstein	524fe784ec	BundledAutotuneCache (take 2) (#137902 ) Summary: Add a cache to combine individual autotune caches into a single cached bundle. We still rely on the individual autotune caches - on a cache hit we copy the individual results into the local caches so they can retrieved later. Attempt 2 of #134959 (D60677499). Various configs: env: TORCHINDUCTOR_BUNDLED_AUTOTUNE_REMOTE_CACHE config: bundled_autotune_remote_cache jk: pytorch/remote_cache:bundled_autotune_remote_cache_version Test Plan: unit tests Manually tested w/ EMU: ``` cd fbcode/accelerators/workloads/models/emu_flash/v1p4 make build_benchmark_model && make save_model_to_path make test_pt2_latency ``` - on a cold run we got 0 hits and 40 misses. On a warm run it got 40 hits and 0 miss. - perf seems a little better - for 8 runs: - no bundled cache averaged 14m11s - bundled cache averaged 14m6s - 125ms saved per cache entry seems reasonable Cache Metrics for an sample run: no bundled cache: ``` INFO: Cache Metrics: FbMemcacheRemoteKernelCache: {hit: 2256, miss: 0, put: 0, exception: 0} FbRemoteAutotuneCache: {hit: 0, miss: 0, put: 7, exception: 0} FbRemoteFxGraphCache: {hit: 40, miss: 0, put: 0, exception: 0} LocalAutotuneCache: {hit: 878, miss: 0, put: 7, exception: 0} backend:MemcacheCache: {hit: 2256, miss: 0, put: 7, exception: 0} backend:_LocalAutotuneCacheBackend: {hit: 878, miss: 0, put: 7, exception: 0} backend:_ManifoldCache: {hit: 40, miss: 0, put: 0, exception: 0} ``` bundled cache: ``` INFO: Cache Metrics: FbMemcacheRemoteKernelCache: {hit: 2258, miss: 0, put: 0, exception: 0} FbRemoteAutotuneCache: {hit: 0, miss: 0, put: 8, exception: 0} FbRemoteBundledAutotuneCache: {hit: 40, miss: 0, put: 0, exception: 0} <<<<<< FbRemoteFxGraphCache: {hit: 40, miss: 0, put: 0, exception: 0} LocalAutotuneCache: {hit: 878, miss: 0, put: 886, exception: 0} backend:MemcacheCache: {hit: 2258, miss: 0, put: 8, exception: 0} backend:_LocalAutotuneCacheBackend: {hit: 878, miss: 0, put: 886, exception: 0} backend:_ManifoldCache: {hit: 80, miss: 0, put: 0, exception: 0} ``` Differential Revision: D64336043 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137902 Approved by: https://github.com/oulgen	2024-10-15 18:39:47 +00:00
albanD	bf77f52895	Fix memory leak on masked Tensor (#137890 ) Note that this reverts the change from https://github.com/pytorch/pytorch/pull/137815 as well which is not needed anymore! Without this, you create an unbeakable reference cycle. It is unbreakable because part of the cycle is through the autograd graph which we cannot traverse. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137890 Approved by: https://github.com/atalman, https://github.com/huydhn, https://github.com/Skylion007	2024-10-15 18:37:55 +00:00
Huy Do	0b7ef196cd	Use filelock to build extension_device backend one at a time (#137930 ) Fixes https://github.com/pytorch/pytorch/issues/136125 Fixes https://github.com/pytorch/pytorch/issues/137026 Fixes https://github.com/pytorch/pytorch/issues/137027 The compilation fails during `setUpClass`, so disabling the test doesn't do nothing. The theory I have for this flaky issue is that `test_open_device_registration` from both `TritonExtensionBackendTests` and `ExtensionBackendTests` are run in parallel and cleaned up while the other is still in fly, causing flaky failure. Here is an example failure https://github.com/pytorch/pytorch/actions/runs/11331105492/job/31512603585#step:22:1710 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137930 Approved by: https://github.com/malfet	2024-10-15 17:46:28 +00:00
PyTorch MergeBot	60eb3fccfa	Revert "[ONNX] Remove ExportTypes (#137789 )" This reverts commit 3e0b83ad1f0a998ef8a72c5e82d9250ab800cce5. Reverted https://github.com/pytorch/pytorch/pull/137789 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/137789#issuecomment-2414632100))	2024-10-15 17:40:06 +00:00
PyTorch MergeBot	2831af39c4	Revert "[ONNX] Remove deprecated export_to_pretty_string (#137790 )" This reverts commit d0628a7e3921639f62d6a6fec9f9f1871e087533. Reverted https://github.com/pytorch/pytorch/pull/137790 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/137789#issuecomment-2414632100))	2024-10-15 17:40:06 +00:00
PyTorch MergeBot	dac0b4e62b	Revert "Add SVE implementation of embedding_lookup_idx (#133995 )" This reverts commit 770c134998d3422bc2fa3b90baa235ed0c409e62. Reverted https://github.com/pytorch/pytorch/pull/133995 on behalf of https://github.com/clee2000 due to breaking internal tests, I wondering if this just needs a targets change for buck? ([comment](https://github.com/pytorch/pytorch/pull/133995#issuecomment-2414596554))	2024-10-15 17:23:50 +00:00
PyTorch MergeBot	d4d687ffb2	Revert "Make Context to be Device-agnostic Step by Step (1/N) (#136519 )" This reverts commit 4a8e49389c33934234dc89616fd17a58e760e2e7. Reverted https://github.com/pytorch/pytorch/pull/136519 on behalf of https://github.com/clee2000 due to breaking internal tests related to MITA, @ezyang has a forward fix? ([comment](https://github.com/pytorch/pytorch/pull/136519#issuecomment-2414588302))	2024-10-15 17:19:16 +00:00
PyTorch MergeBot	9af4e0d2aa	Revert "Make Context to be Device-agnostic Step by Step (2/N) (#136526 )" This reverts commit a6eb0205225fce7ba7a75d200566613b84aff4e9. Reverted https://github.com/pytorch/pytorch/pull/136526 on behalf of https://github.com/clee2000 due to breaking internal tests related to MITA, @ezyang has a forward fix? ([comment](https://github.com/pytorch/pytorch/pull/136519#issuecomment-2414588302))	2024-10-15 17:19:15 +00:00
Pian Pawakapan	44653895cc	override bool(), is_nonzero for real tensor tracing (#136788 ) Fixes bool() and is_nonzero() calls for real tensor tracing, non-strict export Differential Revision: D63482693 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136788 Approved by: https://github.com/ezyang	2024-10-15 17:13:44 +00:00
Haifeng Jin	bdbe0cfffa	Fix test_binary_ufuncs.py for NumPy 2 (#137937 ) Related to #107302 The following tests failed in test_binary_ufuncs.py when testing with NumPy 2. ``` FAILED [0.0050s] test/test_binary_ufuncs.py::TestBinaryUfuncsCPU::test_scalar_support__refs_sub_cpu_complex64 - AssertionError FAILED [0.0043s] test/test_binary_ufuncs.py::TestBinaryUfuncsCPU::test_scalar_support__refs_sub_cpu_float32 - AssertionError FAILED [0.0048s] test/test_binary_ufuncs.py::TestBinaryUfuncsCPU::test_scalar_support_sub_cpu_complex64 - AssertionError FAILED [0.0043s] test/test_binary_ufuncs.py::TestBinaryUfuncsCPU::test_scalar_support_sub_cpu_float32 - AssertionError FAILED [0.0028s] test/test_binary_ufuncs.py::TestBinaryUfuncsCPU::test_shift_limits_cpu_uint8 - OverflowError: Python integer -100 out of bounds for uint8 ``` This PR fixes them. More details: * `test_shift_limits` failed because `np.left_shift()` and `np.right_shift()` no longer support negative shift values in NumPy 2. * `test_scalar_support` failed because NumPy 2 changed its dtype promo rules. We special-cased the incompatible cases by changing the expected dtypes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137937 Approved by: https://github.com/albanD	2024-10-15 17:04:24 +00:00
Nikita Shulga	e4d7676c1b	[CPU] Expand `torch.special.i1` to Half and BF16 (#137899 ) To match behavior of `torch.special.i0` Noticed while looking at the failures in https://github.com/pytorch/pytorch/pull/137849 Also, add explicit high-precision template specialization for `calc_i0` and `calc_i1` for `BFloat16` and `Half` Pull Request resolved: https://github.com/pytorch/pytorch/pull/137899 Approved by: https://github.com/Skylion007	2024-10-15 17:00:58 +00:00
Daniel Velkov	4abe38bc94	RMSprop docs: add missing input "epsilon" (#137854 ) Adding a missing input argument in the docs for RMSprop. Like in the doc for AdamW https://pytorch.org/docs/stable/generated/torch.optim.AdamW.html Pull Request resolved: https://github.com/pytorch/pytorch/pull/137854 Approved by: https://github.com/janeyx99	2024-10-15 16:40:42 +00:00
Haifeng Jin	822aa588bc	Fix torch_np/test_basic for NumPy 2 (#137814 ) Related to #107302 `TestExport.test_exported_objects` in `test/torch_np/test_basic.py` is failing with NumPy 2. The test is checking if all methods under `torch._numpy` exist in `numpy`. However, some of them are removed in NumPy 2. This PR fixes the issue by not checking the removed methods when running with NumPy 2. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137814 Approved by: https://github.com/albanD	2024-10-15 16:40:28 +00:00
Isuru Fernando	120fbe9caa	Update inductor benchmark time to avoid flakiness (#137900 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137900 Approved by: https://github.com/laithsakka	2024-10-15 16:17:04 +00:00
Jack Taylor	966a1a971e	[ROCm] Add AMDSMI support for UUID input (#129741 ) Adds support for for using UUIDs for AMDSMI utilities in PyTorch via CUDA_VISIBLE_DEVICES/HIP_VISIBLE_DEVICES. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129741 Approved by: https://github.com/pruthvistony, https://github.com/jeffdaily	2024-10-15 15:56:30 +00:00
Prachi Gupta	17ed403644	[ROCm] Enable test_triton* in test_sparse_csr suite (#137712 ) All test_triton* UTs are now passing on ROCm within test_sparse_csr suite. See logs here: https://ossci-raw-job-status.s3.amazonaws.com/log/31376189926 Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/137712 Approved by: https://github.com/jithunnair-amd, https://github.com/malfet	2024-10-15 15:41:21 +00:00
Wang, Eikan	5689e33cfe	[Intel GPU] Fix Windows linkage issue due to invisible structured kernel symbols (#137794 ) Intel GPU aten library(libtorch_xpu) utilizes `torchgen` to generate structure kernels. Currently, the generated structure kernels are decorated by `TORCH_API` to control the visibility, while `TORCH_API` is controlled by the `CAFFE2_BUILD_MAIN_LIB` macro. However, we cannot enable `CAFFE2_BUILD_MAIN_LIB` for the Intel GPU ATen library naively. Because the macro not only serves for the `TORCH_API` semantic. It means that the semantic of `TORCH_API` is symbol `hidden`. https://github.com/pytorch/pytorch/blob/main/c10/macros/Export.h#L95-L99 Therefore, we need to use ` TORCH_XPU_API` to decorate the produced structure kernels. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137794 Approved by: https://github.com/atalman ghstack dependencies: #137873	2024-10-15 15:31:37 +00:00
yintong-lu	3361908fc5	torch/ao/quantization/utils.py: Moving eps to targeted device to avoid device mismatch issue (#135204 ) MOTIVATION We recently verified some quantization tests on devices other than cpu (eg. CUDA and Intel Gaudi devices identified as 'hpu'). We noticed a device mismatch error as eps is a tensor created on cpu but other tensors (min_val_neg, max_val_pos, scale, zero_point) are moved to the targeted _device_. CHANGES Move eps to _device_ of other tensors. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135204 Approved by: https://github.com/jgong5, https://github.com/jerryzh168	2024-10-15 14:58:55 +00:00
eellison	cef6c3dcb0	Dont decompose aten.baddmm in inductor (#137904 ) Previously the decomposition would upcasts inputs to fp32. This led to a slowdown compared to eager which would run in fp16. We also tried keeping the bmm in fp16, and the upcasting for the epilogue but that led to worse numerics because the bmm in eager would do the epilogue all in fp32 without a downcast in the bmm accumulator. Fix for https://github.com/pytorch/pytorch/issues/137897 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137904 Approved by: https://github.com/ngimel	2024-10-15 14:54:56 +00:00
Richard Barnes	b7f798caa4	Use C10_UNUSED instead of (void)X (#137239 ) Summary: Auto-generated with ``` buck run //scripts/rbarnes/regex_multiline_replacer:regex_multiline_replacer -- --find '^(\sfor\s$)(const.\n)\s\(void$[A-Za-z]+;\s//\sSuppress.\s\n(.)' --replace '\1C10_UNUSED \2\3' `find caffe2/ -regex ".\.$cpp\\|h$"` ``` Differential Revision: D33432600 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137239 Approved by: https://github.com/Skylion007	2024-10-15 14:32:59 +00:00
Tom Ritchford	e7a4ad3b40	Add decomposition for permute_copy (#130944 ) * Extracted from #129476 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130944 Approved by: https://github.com/amjames, https://github.com/eellison	2024-10-15 13:51:20 +00:00
Xiaodong Wang	5141ade8e3	[AMD] Do not skip 0-byte send/recv (#137952 ) Summary: With https://github.com/ROCm/rccl/pull/1376, we can remove this hack now and we have verified that we no longer run into hang Test Plan: https://www.internalfb.com/mlhub/pipelines/runs/mast/aps-xdwang-900def406a?job_attempt=0&version=1&env=PRODUCTION Differential Revision: D64370817 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137952 Approved by: https://github.com/eqy	2024-10-15 09:35:03 +00:00
Xiaodong Wang	b7be4b1e48	[AMD] Turn on fast path for index_put (#136136 ) Summary: This slow path is bad because it has a sync point which makes CPU really slow. I'm not very sure if AMD actually needs this with the newer rocm versino {F1870213925} Test Plan: CI Differential Revision: D62731130 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136136 Approved by: https://github.com/danzimm, https://github.com/jeffdaily, https://github.com/eqy	2024-10-15 08:39:17 +00:00
Wang, Eikan	f42d1b6fa1	Fix Intel GPU test failure due to unsupport bool for unfold (#137873 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137873 Approved by: https://github.com/etaf, https://github.com/desertfire	2024-10-15 07:58:51 +00:00
cyy	8c860aef0d	[Reland][Environment Variable][3/N] Use thread-safe getenv functions (#137942 ) Reland of #137328, which was reverted due to reverting a dependent PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137942 Approved by: https://github.com/eqy	2024-10-15 07:47:24 +00:00
Ke Wen	56cc22eb01	[CI][Distributed] Not to test distributed_test.py with UCC (#137932 ) Some UCC tests became unstable recently, with or without the M60 to T4 upgrade. See for example: #137855 (without upgrade), #137161 (with upgrade). So I am extracting the disablement from #137161 here. Failure signature: ``` RuntimeError: [/var/lib/jenkins/workspace/torch/csrc/distributed/c10d/ProcessGroupUCC.cpp:496] [Rank 0][ProcessGroupUCC-0][READY]failed to post triggered collective, error code -6: Unhandled error, system error code 0 ``` Earlier discussed here: https://github.com/pytorch/pytorch/pull/137161/files#r1797353294 Cc: @Aidyn-A @eqy Pull Request resolved: https://github.com/pytorch/pytorch/pull/137932 Approved by: https://github.com/fduwjj, https://github.com/malfet, https://github.com/eqy	2024-10-15 07:22:57 +00:00
Edward Z. Yang	5b442e8e92	Time torch_key computation in overall Dynamo stats (#137877 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/137877 Approved by: https://github.com/oulgen, https://github.com/masnesral	2024-10-15 05:47:19 +00:00
Edward Z. Yang	5c3ba6faff	Add fbscribelogger to Dynamo benchmark runner (#137867 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/137867 Approved by: https://github.com/bobrenjc93	2024-10-15 04:36:41 +00:00
Brian Hirsh	ed94725b8c	log ViewAndMutationMeta to trace_structured (#133784 ) I ended up bundling it into the existing tlparse logs for the AOT forward graph, since it looked like registering it as a separate artifact requires changes to tlparse itself (maybe that is wrong though?) Example new fw AOT graph tlparse output for the below code: https://interncache-all.fbcdn.net/manifold/tlparse_reports/tree/logs/.tmp70zKiO/0_0_0/aot_forward_graph_2.txt ``` import torch @torch.compile def f(x): out1 = torch.view_as_complex(x) out2 = torch.view_as_complex(x) return out1, out2, x * 2 x_ = torch.randn(4, 2, requires_grad=True, dtype=torch.float64) out = f(x_) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/133784 Approved by: https://github.com/ezyang	2024-10-15 02:49:02 +00:00
cyy	70206499f1	[3/N] Fix extra warnings brought by clang-tidy-17 (#137552 ) Follows #137459 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137552 Approved by: https://github.com/ezyang	2024-10-15 02:33:44 +00:00
FFFrog	a6eb020522	Make Context to be Device-agnostic Step by Step (2/N) (#136526 ) ---- - add new method(getDefaultGenerator, getNewGenerator) into AcceleratorHooksInterface Pull Request resolved: https://github.com/pytorch/pytorch/pull/136526 Approved by: https://github.com/ezyang, https://github.com/EikanWang	2024-10-15 01:53:28 +00:00
Bob Ren	b34db401f2	Add support for div in tensorify_python_scalars fx pass (#137623 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137623 Approved by: https://github.com/ezyang	2024-10-15 01:49:46 +00:00
Michael Lazos	8316f9b2a0	Fix autograd function calls without context arg (#137809 ) Fixes an issue where if the context arg is not provided, Dynamo would throw an arg mismatch error. The skips are there because Dynamo would previously fall back to eager on those tests due to the arg mismatch error. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137809 Approved by: https://github.com/drisspg	2024-10-15 01:25:47 +00:00
Ryan Guo	a89cf2b59a	[dynamo] Don't codegen temporary cells for pre-existing cells (#137907 ) This patch removes tempvar codegen for the `NewCellVariable` that has `AttributeMutationExisting`, because these tempvar will never get used. Note that tempvar codegen for other objects also follow this pattern, i.e., it only fires on `AttributeMutationNew`. To visualize, in the following program, we'll see the modified bytecode contains redundant `make_cell` calls, and stores the result to a local `tmp_2` which is never used again. ```python import torch def test_write_cell(): count = torch.ones(1) def inc(): nonlocal count count = count + 1 torch.compile() def fn(): inc() fn() test_write_cell() ``` ``` $ TORCH_LOGS="bytecode" TORCH_LOGS_FORMAT="short" python test.py ...... 0 COPY_FREE_VARS 1 2 RESUME 0 4 LOAD_GLOBAL 9 (NULL + __compiled_fn_2) 14 LOAD_DEREF 3 (inc) 16 LOAD_ATTR 6 (__closure__) 36 LOAD_CONST 1 (0) 38 BINARY_SUBSCR 42 LOAD_ATTR 4 (cell_contents) 62 CALL 1 70 STORE_FAST 0 (graph_out_0) 72 LOAD_GLOBAL 0 (__import_torch_dot__dynamo_dot_utils) 82 LOAD_ATTR 3 (NULL\|self + make_cell) 102 CALL 0 110 STORE_FAST 2 (tmp_2) 112 LOAD_FAST 0 (graph_out_0) 114 LOAD_CONST 1 (0) 116 BINARY_SUBSCR 120 LOAD_DEREF 3 (inc) 122 LOAD_ATTR 6 (__closure__) 142 LOAD_CONST 1 (0) 144 BINARY_SUBSCR 148 STORE_ATTR 2 (cell_contents) 158 DELETE_FAST 0 (graph_out_0) 160 RETURN_CONST 0 (None) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/137907 Approved by: https://github.com/anijain2305	2024-10-15 00:49:45 +00:00
chilli	1cf78bbf62	Refactored debug_extra to be on ChoiceCaller (and called description) (#137857 ) Before: <img width="644" alt="image" src="https://github.com/user-attachments/assets/17b0fa8a-37c8-494b-8914-9d42c3db4bef"> After: <img width="1292" alt="image" src="https://github.com/user-attachments/assets/5ee59747-a34f-4dd6-b943-cb5a53d52080"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/137857 Approved by: https://github.com/ezyang, https://github.com/jansel, https://github.com/masnesral ghstack dependencies: #137768	2024-10-15 00:48:14 +00:00
Edward Z. Yang	3630398509	Move symbolic_shapes create_env back to INFO (#137926 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/137926 Approved by: https://github.com/Skylion007	2024-10-15 00:37:01 +00:00
cyyever	406db6a73d	Improve ASAN path detection (#137865 ) Follows #137335, for better adoption of latest clang to ASAN jobs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137865 Approved by: https://github.com/Skylion007 Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>	2024-10-14 23:54:46 +00:00
Shivam Raikundalia	aef3591998	[Profiler] Add Test for Clear on Fork (#137511 ) Summary: Tests Fix Clear On Fork by forking a process after a profile has already been done. Afterwards we check that all the PID/TID are as expected. Test Plan: Ran buck2 test 'fbcode//mode/dev' fbcode//caffe2/test:profiler -- --exact 'caffe2/test:profiler - test_forked_process (profiler.test_profiler.TestProfiler)' Differential Revision: D63992036 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137511 Approved by: https://github.com/sanrise, https://github.com/aaronenyeshi	2024-10-14 23:20:33 +00:00
Nikita Shulga	0786b37260	[MPS] Add i0 op (#137849 ) More-or-less verbatim copy of `47c8aa8090/aten/src/ATen/native/Math.h (L101)` Plus a bit of a MPS boilerplate code Update test_mps.py to mark kaiser_window and i0 as passing Pull Request resolved: https://github.com/pytorch/pytorch/pull/137849 Approved by: https://github.com/Skylion007	2024-10-14 22:50:01 +00:00
Nikita Shulga	18587f2427	[BE] Use `std::enable_if_t` in Math.h (#137920 ) PyTorch is C++17 project, so let's use some C++17 convenience methods Pull Request resolved: https://github.com/pytorch/pytorch/pull/137920 Approved by: https://github.com/Skylion007 Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>	2024-10-14 22:20:09 +00:00
Catherine Lee	8ac06467d4	Forward fix test (#137910 ) Summary: Add back in a deleted file to fix test It was removed in https://github.com/pytorch/pytorch/pull/137404 Test Plan: `buck2 build --flagfile fbcode//mode/opt fbcode//caffe2/test/cpp/c10d:ProcessGroupGlooAsyncTest` succeeded Differential Revision: D64341074 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137910 Approved by: https://github.com/Skylion007, https://github.com/huydhn, https://github.com/kit1980	2024-10-14 22:07:29 +00:00
Jerry Zhang	ad134fe038	Skip doc test internally (#137813 ) Summary: there are some path issues when we run the doc tests internally https://www.internalfb.com/intern/test/281475143872621 Test Plan: sandcastle Reviewed By: drisspg, msaroufim Differential Revision: D64255824 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137813 Approved by: https://github.com/HDCharles	2024-10-14 21:29:15 +00:00
Eddie Yan	7911bf591d	[CUDA][Inductor] Fix some `bfloat16` tests for SM70 (#137675 ) Unsure about the `runtime_checks` changes as that's a pure pattern-match and guess Pull Request resolved: https://github.com/pytorch/pytorch/pull/137675 Approved by: https://github.com/eellison, https://github.com/jansel	2024-10-14 20:42:48 +00:00
atalman	6016b8a9be	Remove CI/CD python 3.8 requirements (#137893 ) Python 3.8 is deprecated from CI/CD. No reason have these pins Pull Request resolved: https://github.com/pytorch/pytorch/pull/137893 Approved by: https://github.com/Skylion007, https://github.com/huydhn, https://github.com/albanD, https://github.com/kit1980	2024-10-14 20:28:48 +00:00
PyTorch MergeBot	3b7710316c	Revert "cublaslt autotuning support for TunableOp (#133896 )" This reverts commit 19bbbef79da8ed32f72d6e76517cb639d5db6c00. Reverted https://github.com/pytorch/pytorch/pull/133896 on behalf of https://github.com/clee2000 due to this is breaking internal builds, I've copied what I think is the most relevant part of the log below. I believe the job running internally uses an old version of cuda, could you put guards to make sure compilation still words on an older version of cuda/cublaslt? ([comment](https://github.com/pytorch/pytorch/pull/133896#issuecomment-2412180893))	2024-10-14 20:28:09 +00:00
PyTorch MergeBot	df0c2f5cae	Revert "[Environment Variable][3/N] Use thread-safe getenv wrapper (#137328 )" This reverts commit 25ac5652d003c5526f496bd1e2cdfbe697c58ba4. Reverted https://github.com/pytorch/pytorch/pull/137328 on behalf of https://github.com/clee2000 due to need to revert this in order to revert #133896, please rebase and reland, sorry for the churn ([comment](https://github.com/pytorch/pytorch/pull/137328#issuecomment-2412143739))	2024-10-14 20:22:26 +00:00
Jagadish Krishnamoorthy	674d59359d	[ROCm] Enable dist sharded_tensor test suites (#137724 ) Following test suites are enabled on ROCm test_sharded_tensor test_sharded_tensor_reshard test_sharding_plan Pull Request resolved: https://github.com/pytorch/pytorch/pull/137724 Approved by: https://github.com/jithunnair-amd, https://github.com/pruthvistony, https://github.com/malfet	2024-10-14 20:20:57 +00:00
Alex Baden	39d21ed803	[Inductor] Update AttrsDescriptor instantiation for Triton changes (#137458 ) The `AttrsDescriptor` class has been present in Triton for almost a year now (introduced [here](`72c9833927`)), so we should be able to rely on it existing. I am in the process of supporting the new `AttrsDescriptor` class and @jansel suggested I split changes to the existing class out separately to make sure nothing breaks removing the legacy attribute descriptor attributes. Initially I attempted to remove the branching around detecting whether `AttrsDescriptor` exists but that breaks because PyTorch must build without Triton. So, I went back and updated for the naming introduced in the commit linked above, and also removed two unused attributes `divisible_by_8` and `ids_to_fold` which were removed in Feb 2024 (https://github.com/triton-lang/triton/pull/3122 and https://github.com/triton-lang/triton/pull/3080 respectively). With these changes only the internal workings of the `AttrsDescriptor` class will differ between supported Triton versions, but the data stored will remain consistent. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137458 Approved by: https://github.com/jansel	2024-10-14 20:20:29 +00:00
rzou	11e4232b42	Revert "[Dynamo][autograd.Function] Trace fwd graph under no_grad mode (#134872 )" (#137891 ) This reverts commit e688b78791d01bd91614a61e57726c32beb46ee4. We're reverting this because: 1) The original PR (#134872) fixed a bug but caused another one. The assessment is that the bug it caused is worse than the bug it fixed. 2) it was reverted on the release 2.5 branch, so we want to prevent divergence 3) The original author is out-of-office for a while so we don't want the divergence to wait until they're back Pull Request resolved: https://github.com/pytorch/pytorch/pull/137891 Approved by: https://github.com/Skylion007	2024-10-14 20:12:58 +00:00
Will Constable	41c4aa9f7a	[pipelining] rename prev_/next_stage vars to clarify (#137739 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137739 Approved by: https://github.com/H-Huang	2024-10-14 20:12:18 +00:00
drisspg	78299d75b7	[ScaledMM] More Large shape tuning (#137832 ) Fixes buggy in previous PR with check, and also after some more performance tuning at very large sizes found that when N > M it is valuable to transpose otherwise performance is better untransposed: If you look at the absolute Tflops I think we still have some room for improvement! ### Perf Here are some TFLOP deltas at larger sizes where green is the positive gain in TFLops at different values of K ![large_shape_old_vs_update_m_greater_n_FP8Kernel_SCALED_MM_K32768_tflops_delta_heatmap](https://github.com/user-attachments/assets/dcd009a5-1e4f-449c-b852-a92bb7db66e3) <details> <summary>### Different Values of K</summary> ![large_shape_old_vs_update_m_greater_n_FP8Kernel_SCALED_MM_K24576_tflops_delta_heatmap](https://github.com/user-attachments/assets/8c043f6c-b8aa-48a9-bd5d-3ec6f39818cd) ![large_shape_old_vs_update_m_greater_n_FP8Kernel_SCALED_MM_K16384_tflops_delta_heatmap](https://github.com/user-attachments/assets/41a4b9f4-2749-4a84-b9c7-bddc2c2334c0) ![large_shape_old_vs_update_m_greater_n_FP8Kernel_SCALED_MM_K12288_tflops_delta_heatmap](https://github.com/user-attachments/assets/68d42421-cfa9-4a0a-a5a5-9f6db80bf609) ![large_shape_old_vs_update_m_greater_n_FP8Kernel_SCALED_MM_K8192_tflops_delta_heatmap](https://github.com/user-attachments/assets/c03906a0-5de7-463e-96a8-85f1774b3af6) ![large_shape_old_vs_update_m_greater_n_FP8Kernel_SCALED_MM_K6144_tflops_delta_heatmap](https://github.com/user-attachments/assets/d697b2d0-efc9-4ea8-9002-d517f3abaf50) ![large_shape_old_vs_update_m_greater_n_FP8Kernel_SCALED_MM_K4096_tflops_delta_heatmap](https://github.com/user-attachments/assets/06f8ef5c-277f-45ca-a44f-ed2e54d4133a) </details> <details> <summary>### Absolute Tflops</summary> ## Old ![large_shape_old_FP8Kernel_SCALED_MM_K32768_tflops_heatmap](https://github.com/user-attachments/assets/8872506b-0ff1-400e-8d11-71eff6d8d59a) ## New ![update_m_greater_n_FP8Kernel_SCALED_MM_K32768_tflops_heatmap](https://github.com/user-attachments/assets/9fc9ec24-ff1a-4b47-8934-72d181677d14) </details> Pull Request resolved: https://github.com/pytorch/pytorch/pull/137832 Approved by: https://github.com/vkuzo	2024-10-14 20:02:52 +00:00
Edward Z. Yang	d64492e4cb	Increase verbosity of inductor cache hit/miss to INFO level (#137876 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/137876 Approved by: https://github.com/Skylion007	2024-10-14 19:59:31 +00:00
eqy	914c90dcea	[NCCL][CUDA] Set `PYTORCH_C10_DRIVER_API_SUPPORTED` in `ProcessGroupNCCL.cpp` compilation (#137828 ) Otherwise `expandable_segments()` is hardcoded to false in `CUDAAllocatorConfig.h` Pull Request resolved: https://github.com/pytorch/pytorch/pull/137828 Approved by: https://github.com/yifuwang, https://github.com/Skylion007	2024-10-14 19:38:23 +00:00
Joel Schlosser	19918a1863	Fix autograd.Function + NJT when an output grad is None (#136875 ) For `autograd.Function`, the engine will try to allocate correctly-shaped zeros for `None` grads (i.e. in the case where the output isn't used downstream). It determines the shape of these zeros from the `VariableInfo` entry, which is derived from the forward output shape. For the NJT forward output case, the size info stored will contain a nested int, and calling `zeros()` with this size throws: ``` RuntimeError: .../build/aten/src/ATen/RegisterCPU.cpp:5260: SymIntArrayRef expected to contain only concrete integers ``` This PR fixes this by storing the full tensor in the `VariableInfo` for the nested case and calling `zeros_like()` to allocate correctly-shaped zeros. This is pretty inefficient; ideally we would want to save just the NJT shape and be able to construct zeros from it, but this requires factory function support for nested ints (WIP). So this is a short-term fix until we have that. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136875 Approved by: https://github.com/soulitzer, https://github.com/huydhn	2024-10-14 19:31:50 +00:00
ErezYosef	197601eeea	Add Support for Tracking Parameter Names (named_parameters) in Optimizer State Dict (#134107 ) A proposal addressing Issue #1489: Optimizer should track parameter names and not id. (also mentioned in here: [[RFC] Introducing FQNs/clarity eyeglasses to optim state_dict](https://dev-discuss.pytorch.org/t/rfc-introducing-fqns-clarity-to-optim-state-dict/1552) ## Summary This PR introduces a backward-compatible enhancement where optimizers track parameter names instead of just their id. Optimizers can be initialized with `named_parameters()` as: ```python optimizer = optim.SGD(model.named_parameters(), lr=0.01, momentum=0.9) ``` This allows for greater clarity and ease when handling optimizers, as the parameters' names are preserved within the optimizer’s `state_dict` as: ``` state_dict = { 'state': { 0: {'momentum_buffer': tensor(...), ...}, 1: {'momentum_buffer': tensor(...), ...}, }, 'param_groups': [ { 'lr': 0.01, 'weight_decay': 0, ... 'params': [0,1] 'param_names' ['layer.weight', 'layer.bias'] (optional) } ] } ``` Loading `state_dict` is not changed (backward-compatible) and the `param_names` key will be ignored. ## Key Features #### Named Parameters in Optimizer Initialization: Optimizers can accept the output of `model.named_parameters()` during initialization, allowing them to store parameter names directly. #### Parameter Names in `state_dict`: The parameter names are saved as a list in the optimizer’s `state_dict` with key `param_names`, alongside the `params` indices, ensuring seamless tracking of both names and parameters. ## Backward Compatibility #### No Breaking Changes: This change is fully backward-compatible. The added `param_names` key in the optimizer's `state_dict` is ignored when loading a state to the optimizer. #### Customization with Hooks: For more control, the loaded state_dict can be modified using a custom `register_load_state_dict_pre_hook`, providing flexibility for different design needs. ## Documentation Updates Please refer to the documentation changes for more details on how this feature is implemented and how it can be used effectively. ## Solution Example: A suggested solution to the problem mentioned in #1489, for the same parameters but in a different order. The following `register_load_state_dict_pre_hook` should be added to the optimizer before loading to enable loading the state dict : ```python def adapt_state_dict_ids(optimizer, state_dict): # assuming a single param group. current_state_group = optimizer.state_dict()['param_groups'][0] loaded_state_group = state_dict['param_groups'][0] # same number of params, same names, only different ordering current_state_name_to_id_mapping = {} # mapping -- param_name: id for i, name in enumerate(current_state_group['param_names']): current_state_name_to_id_mapping[name] = current_state_group['params'][i] # changing the ids of the loaded state dict to match the order of the given state dict. for i, name in enumerate(current_state_group['param_names']): loaded_state_group['params'][i] = current_state_name_to_id_mapping[name] return state_dict ``` In this code, the loaded `state_dict` ids are adapted to match the order of the current optimizer `state_dict`. Both the previous and the current optimizers are required to be initiated with `named_parameters()` to have the 'param_names' key in the dict. ### Note This is my first contribution to PyTorch, and I wish to receive feedback or suggestions for improvement. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134107 Approved by: https://github.com/janeyx99 Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com>	2024-10-14 19:24:44 +00:00
Tom Ritchford	4470339fbb	[dynamo] Fix an error in _dynamo.compiled_autograd.reset() (#137889 ) ---- * From https://github.com/pytorch/pytorch/pull/133492 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137889 Approved by: https://github.com/Skylion007	2024-10-14 18:21:18 +00:00
Huy Do	929797dedb	Fix test_matmul_offline_tunableop by writing its output files to a temp dir (#137835 ) The test is failing (flakily?) on periodic Windows CUDA jobs with the following error: ``` __________ TestLinalgCUDA.test_matmul_offline_tunableop_cuda_float16 __________ Traceback (most recent call last): File "C:\actions-runner\_work\pytorch\pytorch\test\test_linalg.py", line 4618, in test_matmul_offline_tunableop os.remove(filename) PermissionError: [WinError 32] The process cannot access the file because it is being used by another process: 'tunableop_untuned0.csv' ``` For example, https://github.com/pytorch/pytorch/actions/runs/11292745299/job/31410578167#step:15:15097 The test tried to catch and ignore this, but this is Windows. So, the fix is to: 1. Ignore if these files couldn't be removed 2. Write them to a temp directory instead, otherwise, [assert_git_not_dirty](https://github.com/pytorch/pytorch/blob/main/.ci/pytorch/test.sh#L286) won't be happy Pull Request resolved: https://github.com/pytorch/pytorch/pull/137835 Approved by: https://github.com/atalman	2024-10-14 17:28:33 +00:00
PyTorch MergeBot	f8a5b7170a	Revert "Fix autograd.Function + NJT when an output grad is None (#136875 )" This reverts commit 76ab1ab66560213701943ecde368aedcd5de08e5. Reverted https://github.com/pytorch/pytorch/pull/136875 on behalf of https://github.com/jbschlosser due to Caused memory leak ([comment](https://github.com/pytorch/pytorch/pull/136875#issuecomment-2411665776))	2024-10-14 16:00:44 +00:00
Bob Ren	47bb494e49	Add support for sub in tensorify_python_scalars fx pass (#137622 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137622 Approved by: https://github.com/ezyang ghstack dependencies: #137620	2024-10-14 15:37:29 +00:00
Bob Ren	f246507f28	Add support for add in tensorify_python_scalars fx pass (#137620 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137620 Approved by: https://github.com/ezyang	2024-10-14 15:10:27 +00:00
Sanket Purandare	a77145ae2f	Selective Activation Checkpointing (SAC) Estimator for estimating memory and recomputation time trade-offs. (#135208 ) This PR adds a Selective Activation Checkpointing (SAC) Estimator, built on top of the `Runtime Estimator`, for estimating memory and recomputation time trade-offs. It provides a `TorchDispatchMode` based context manager that estimates the memory and runtime trade-offs of functions or `torch.nn.Modules` for SAC, using the `Runtime Estimator` #134243 under the hood to support two estimation modes: 'operator-level-benchmark' and 'operator-level-cost-model' (roofline model). The SAC Estimator provides detailed statistics and metadata information for operators of each module, including greedy order for selecting operators to be recomputed/checkpointed and per-module trade-off graphs. This estimator is designed to be used under FakeTensorMode and currently supports estimation of compute time and memory usage." It's inspired from: [XFormers SAC](https://github.com/facebookresearch/xformers/blob/main/xformers/checkpoint.py) by @fmassa End-to-end example: ``` import torch from torch._subclasses.fake_tensor import FakeTensorMode from torch.distributed._tools.sac_estimator import SACEstimator from torch.testing._internal.distributed._tensor.common_dtensor import ( ModelArgs, Transformer, ) if __name__ == "__main__": dev = torch.cuda.current_device() vocab_size = 8192 bsz, seq_len = 8, 1024 model_args = ModelArgs( n_layers=4, n_heads=12, vocab_size=vocab_size, max_seq_len=seq_len, dim=768, dropout_p=0.1, ) with FakeTensorMode(): with torch.device(dev): model = Transformer(model_args) inp = torch.randint( 0, model_args.vocab_size, (bsz, model_args.max_seq_len), device=dev ) sace = SACEstimator() with sace(estimate_mode_type='operator-level-cost-model'): loss = model(inp).sum() loss.backward() sace.pwlf_sac_tradeoff_curve(n_segments=2, save_tradeoff_graphs=True) sace.display_modulewise_sac_stats(depth=4, print_tabular=True) ``` Example AC Stats for one of the transformer layers: ![Screenshot 2024-10-11 at 10 09 13 PM](https://github.com/user-attachments/assets/1cf85564-4319-4732-bba1-89d505cda6ab) Example AC Trade-off for one of the transformer layers: ![Screenshot 2024-10-11 at 10 09 58 PM](https://github.com/user-attachments/assets/5b2f343c-7e73-4c7d-bfea-3dcef2caa362) Example AC Trade-Off graph one of the transformer layers: ![Transformer layers 3](https://github.com/user-attachments/assets/490d4b37-a916-4298-a14c-f78ffecbbde2) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135208 Approved by: https://github.com/weifengpy	2024-10-14 13:56:40 +00:00
chilli	0e4d42634e	Port Inductor dataclasses to be kw_only (#137768 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137768 Approved by: https://github.com/ezyang	2024-10-14 10:33:43 +00:00
Siddhartha Menon	770c134998	Add SVE implementation of embedding_lookup_idx (#133995 ) Adds an accelerated version of the embedding_lookup_idx perfkernels. This is done via a python codegen file similarly to `caffe2/perfkernels/hp_emblookup_codegen.py` Pull Request resolved: https://github.com/pytorch/pytorch/pull/133995 Approved by: https://github.com/malfet, https://github.com/huydhn	2024-10-14 10:17:27 +00:00
cyy	c48fe89011	Make c10::string_view an alias of std::string_view (#130417 ) In order to facilitate the mitigation from c10::string_view to std::string_view, the old c10::string_view was renamed to c10::string_view_ext. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130417 Approved by: https://github.com/ezyang	2024-10-14 09:28:04 +00:00
PyTorch MergeBot	41977a0531	Revert "Port Inductor dataclasses to be kw_only (#137768 )" This reverts commit 65d665bae5b82a54b819c0c4527e7ccf88d19427. Reverted https://github.com/pytorch/pytorch/pull/137768 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but it seem to fail test_loop_ordering in trunk ([comment](https://github.com/pytorch/pytorch/pull/137768#issuecomment-2409203115))	2024-10-13 22:25:19 +00:00
Isuru Fernando	08ce3aac62	Cache some ValueRanges (#137438 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137438 Approved by: https://github.com/ezyang	2024-10-13 19:23:34 +00:00
GarfieldHan	b361cd01f1	profiler: Fix undefined reference to `unwind_c` in `unwind_entry` while LTO is enabled (#137862 ) With LTO(Link Time Optimization) enabled in CFLAGS, some compiler will optimize and strip the unwind_c function, which is caused by compiler that couldn’t resolve reference correctly, thus breaking the build with undefined reference in unwind_entry. Add an attribute to avoid this bad situation. Fixes #121282 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137862 Approved by: https://github.com/Skylion007 Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>	2024-10-13 19:04:58 +00:00
iupaikov-amd	c09b567a91	Fixed error string assertion in test_invalid_devices (#137772 ) ROCm distribution returns different error string for this operation so the test fails this assertion. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137772 Approved by: https://github.com/Skylion007	2024-10-13 18:10:07 +00:00
chilli	65d665bae5	Port Inductor dataclasses to be kw_only (#137768 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137768 Approved by: https://github.com/ezyang	2024-10-13 14:55:45 +00:00
Bin Bao	cfc5d18aad	[AOTI] Turn on the ABI-compatible mode as default (#136534 ) Summary: Make AOTI generate ABI-compatible code as default for OSS. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136534 Approved by: https://github.com/chenyang78 ghstack dependencies: #137660	2024-10-13 14:42:58 +00:00
Bin Bao	b181652f3d	[AOTI] Handle inplace output in ProxyExecutor (#137660 ) Summary: https://github.com/pytorch/pytorch/pull/137401 didn't fix the underlying inplace output issue. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137660 Approved by: https://github.com/chenyang78	2024-10-13 14:42:58 +00:00
cyy	a90b920284	Install llvm18 packages for ASAN workflows (#137335 ) Follows #128763 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137335 Approved by: https://github.com/ezyang	2024-10-13 13:49:38 +00:00
FFFrog	4a8e49389c	Make Context to be Device-agnostic Step by Step (1/N) (#136519 ) ---- - make init to be device-agnostic and move it to AcceleratorHooksInterface - refactoring context related to device initialization Pull Request resolved: https://github.com/pytorch/pytorch/pull/136519 Approved by: https://github.com/ezyang, https://github.com/EikanWang, https://github.com/guangyey	2024-10-13 12:38:02 +00:00
PyTorch MergeBot	563e9f99c3	Revert "Add device agnostic API for accelerator hooks (#137480 )" This reverts commit 858c91c3d8d9a71c66d0357e51a4bd805f95599f. Reverted https://github.com/pytorch/pytorch/pull/137480 on behalf of https://github.com/albanD due to break all builds on trunk ([comment](https://github.com/pytorch/pytorch/pull/137480#issuecomment-2408954802))	2024-10-13 12:12:37 +00:00
Yuxin Wu	08576b254b	Fix logging in socket.cpp (#137745 ) Formatter shall avoid throwing exceptions as much as possible. Fixes https://github.com/pytorch/pytorch/pull/128673#discussion_r1796226656 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137745 Approved by: https://github.com/d4l3k, https://github.com/Skylion007	2024-10-13 10:38:10 +00:00
xangma	fe8d66d9a6	Faster Faster BatchSampler (#137423 ) Builds upon #76951. Benchmarking code is the same as in #76950. AMD Ryzen Threadripper PRO 3995WX: ``` batch_size drop_last origin new speedup ------------ ----------- -------- ------ --------- 4 True 0.94 0.5706 64.74% 4 False 0.9745 0.9468 2.93% 8 True 0.7423 0.3715 99.82% 8 False 0.7974 0.5666 40.73% 64 True 0.5394 0.2085 158.76% 64 False 0.6083 0.2697 125.51% 640 True 0.5448 0.1985 174.41% 640 False 0.7085 0.2308 206.91% 6400 True 0.5554 0.2028 173.88% 6400 False 0.7711 0.2109 265.60% 64000 True 0.556 0.2091 165.82% 64000 False 0.7803 0.2078 275.58% ``` When `drop_last == True`, it uses `zip` to speed things up. When `drop_last == False`, it uses `itertools` to speed things up. `itertools` was the fastest way I could find that deals with the last batch if it is smaller than `batch_size`. I have a pure python method too, but it is slower when `batch_size` is 4 or 8, so I have committed the `itertools` version for now. Happy to chat further about this change :-) I understand you may not want to introduce the `itertools` package into [sampler.py](https://github.com/pytorch/pytorch/blob/main/torch/utils/data/sampler.py). Pull Request resolved: https://github.com/pytorch/pytorch/pull/137423 Approved by: https://github.com/Skylion007	2024-10-13 09:36:03 +00:00
Michael Au-Yeung	b3af359cba	Log WorkNCCL exception string to C10dLogger (#137736 ) Summary: In WorkNCCL::handleException, log to c10d logger with `strings["work_nccl_exception"]`. Test Plan: Test run job to verify NCCL exception is logged. Differential Revision: D62603322 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137736 Approved by: https://github.com/c-p-i-o, https://github.com/fduwjj	2024-10-13 07:33:05 +00:00
zeshengzong	858c91c3d8	Add device agnostic API for accelerator hooks (#137480 ) Make `AcceleratorHooksInterface` consistent for multiple accelerators - Add `showConfig` and `deviceSynchronize` method declaration in `AcceleratorHooksInterface` - Remove unreachable lines of code Pull Request resolved: https://github.com/pytorch/pytorch/pull/137480 Approved by: https://github.com/albanD, https://github.com/FFFrog	2024-10-13 07:19:32 +00:00
Xiaodong Wang	7642f6d047	[AMD] Unify cublaslt and hipblaslt path (#137604 ) Differential Revision: D63967918 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137604 Approved by: https://github.com/eqy	2024-10-13 07:11:12 +00:00
Wang, Eikan	fa08e924ad	Skip test export with fake tensor inputs on cuda devices for Intel GPU (#137847 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137847 Approved by: https://github.com/etaf, https://github.com/jansel	2024-10-13 07:07:48 +00:00
FFFrog	e3df636580	Fix -Wsign-compare warning spam in Indexing.cu (#137842 ) Detailed Descriptions: Fix for warning spam like ``` warning: comparison of integer expressions of different signedness: ‘uint64_t’ {aka ‘long unsigned int’} and ‘long int’ [-Wsign-compare] ``` ![image](https://github.com/user-attachments/assets/7be3cfff-c33b-4a6e-b52d-04085e6e1bec) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137842 Approved by: https://github.com/ezyang	2024-10-13 07:03:12 +00:00
Xuehai Pan	1d6932937e	[dynamo] fix `NamedTupleVariable` for PyStructSequence (`torch.return_types.`) support (#137776 ) PyStructSequence is the C API equivalent for `collections.namedtuple` in Python. But they have different constructors: ```python tuple = NamedTupleType(args) tuple = NamedTupleType._make(args) tuple = StructSequenceType(args) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/137776 Approved by: https://github.com/jansel	2024-10-13 06:46:41 +00:00
Animesh Jain	3050f2e5dd	[dynamo] Check nn modules parameters are not overwritten before taking tracing shortcut (#137824 ) Fixes https://github.com/pytorch/pytorch/issues/136257 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137824 Approved by: https://github.com/jansel	2024-10-13 05:04:28 +00:00
abhishek-fujitsu	09e2a0d7bc	fix PyTorch build with Address Sanitizer enabled (#137446 ) Problem Building PyTorch with Address Sanitizer (ASAN) enabled was failing due to a static assertion in KernelFunction_impl.h. The compiler was unable to evaluate FuncPtr::func_ptr() as a constant expression when ASAN was enabled, causing a build error. ``` FAILED: caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/xnnpack/RegisterOpContextClass.cpp.o /usr/bin/ccache /usr/bin/g++-11 -DAT_BUILD_ARM_VEC256_WITH_SLEEF -DAT_PER_OPERATOR_HEADERS -DCAFFE2_BUILD_MAIN_LIB -DCPUINFO_SUPPORTED_PLATFORM=1 -DFLASHATTENTION_DISABLE_ALIBI -DFMT_HEADER_ONLY=1 -DFXDIV_USE_INLINE_ASSEMBLY=0 -DHAVE_MALLOC_USABLE_SIZE=1 -DHAVE_MMAP=1 -DHAVE_SHM_OPEN=1 -DHAVE_SHM_UNLINK=1 -DMINIZ_DISABLE_ZIP_READER_CRC32_CHECKS -DNNP_CONVOLUTION_ONLY=0 -DNNP_INFERENCE_ONLY=0 -DONNXIFI_ENABLE_EXT=1 -DONNX_ML=1 -DONNX_NAMESPACE=onnx_torch -DUSE_C10D_GLOO -DUSE_C10D_MPI -DUSE_DISTRIBUTED -DUSE_EXTERNAL_MZCRC -DUSE_RPC -DUSE_TENSORPIPE -D_FILE_OFFSET_BITS=64 -D_GLIBCXX_SANITIZE_STD_ALLOCATOR -D_GLIBCXX_SANITIZE_VECTOR -Dtorch_cpu_EXPORTS -I/home/abhishekk/stantize/venv/pytorch/build/aten/src -I/home/abhishekk/stantize/venv/pytorch/aten/src -I/home/abhishekk/stantize/venv/pytorch/build -I/home/abhishekk/stantize/venv/pytorch -I/home/abhishekk/stantize/venv/pytorch/cmake/../third_party/benchmark/include -I/home/abhishekk/stantize/venv/pytorch/third_party/onnx -I/home/abhishekk/stantize/venv/pytorch/build/third_party/onnx -I/home/abhishekk/stantize/venv/pytorch/nlohmann -I/home/abhishekk/stantize/venv/pytorch/torch/csrc/api -I/home/abhishekk/stantize/venv/pytorch/torch/csrc/api/include -I/home/abhishekk/stantize/venv/pytorch/caffe2/aten/src/TH -I/home/abhishekk/stantize/venv/pytorch/build/caffe2/aten/src/TH -I/home/abhishekk/stantize/venv/pytorch/build/caffe2/aten/src -I/home/abhishekk/stantize/venv/pytorch/build/caffe2/../aten/src -I/home/abhishekk/stantize/venv/pytorch/torch/csrc -I/home/abhishekk/stantize/venv/pytorch/third_party/miniz-2.1.0 -I/home/abhishekk/stantize/venv/pytorch/third_party/kineto/libkineto/include -I/home/abhishekk/stantize/venv/pytorch/third_party/kineto/libkineto/src -I/home/abhishekk/stantize/venv/pytorch/third_party/cpp-httplib -I/home/abhishekk/stantize/venv/pytorch/aten/src/ATen/.. -I/home/abhishekk/stantize/venv/pytorch/third_party/FXdiv/include -I/home/abhishekk/stantize/venv/pytorch/c10/.. -I/home/abhishekk/stantize/venv/pytorch/third_party/pthreadpool/include -I/home/abhishekk/stantize/venv/pytorch/third_party/cpuinfo/include -I/home/abhishekk/stantize/venv/pytorch/aten/src/ATen/native/quantized/cpu/qnnpack/include -I/home/abhishekk/stantize/venv/pytorch/aten/src/ATen/native/quantized/cpu/qnnpack/src -I/home/abhishekk/stantize/venv/pytorch/aten/src/ATen/native/quantized/cpu/qnnpack/deps/clog/include -I/home/abhishekk/stantize/venv/pytorch/third_party/NNPACK/include -I/home/abhishekk/stantize/venv/pytorch/third_party/FP16/include -I/home/abhishekk/stantize/venv/pytorch/third_party/tensorpipe -I/home/abhishekk/stantize/venv/pytorch/build/third_party/tensorpipe -I/home/abhishekk/stantize/venv/pytorch/third_party/tensorpipe/third_party/libnop/include -I/home/abhishekk/stantize/venv/pytorch/third_party/fmt/include -I/home/abhishekk/stantize/venv/pytorch/third_party/flatbuffers/include -isystem /home/abhishekk/stantize/venv/pytorch/build/third_party/gloo -isystem /home/abhishekk/stantize/venv/pytorch/cmake/../third_party/gloo -isystem /home/abhishekk/stantize/venv/pytorch/cmake/../third_party/tensorpipe/third_party/libuv/include -isystem /home/abhishekk/stantize/venv/pytorch/cmake/../third_party/googletest/googlemock/include -isystem /home/abhishekk/stantize/venv/pytorch/cmake/../third_party/googletest/googletest/include -isystem /home/abhishekk/stantize/venv/pytorch/third_party/protobuf/src -isystem /home/abhishekk/stantize/venv/pytorch/third_party/XNNPACK/include -isystem /home/abhishekk/stantize/venv/pytorch/cmake/../third_party/eigen -isystem /home/abhishekk/stantize/venv/pytorch/INTERFACE -isystem /home/abhishekk/stantize/venv/pytorch/third_party/nlohmann/include -isystem /home/abhishekk/stantize/venv/pytorch/build/include -isystem /usr/lib/aarch64-linux-gnu/openmpi/include -isystem /usr/lib/aarch64-linux-gnu/openmpi/include/openmpi -D_GLIBCXX_USE_CXX11_ABI=1 -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DUSE_KINETO -DLIBKINETO_NOCUPTI -DLIBKINETO_NOROCTRACER -DLIBKINETO_NOXPUPTI=ON -DUSE_PYTORCH_QNNPACK -DAT_BUILD_ARM_VEC256_WITH_SLEEF -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=range-loop-construct -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wsuggest-override -Wno-psabi -Wno-error=old-style-cast -Wno-missing-braces -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow -DHAVE_SVE_CPU_DEFINITION -DHAVE_SVE256_CPU_DEFINITION -g -fno-omit-frame-pointer -Og -std=gnu++17 -fPIC -DTORCH_USE_LIBUV -DCAFFE2_USE_GLOO -D__NEON__ -Wall -Wextra -Wdeprecated -Wno-unused-parameter -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-strict-overflow -Wno-strict-aliasing -Wunused-function -Wunused-variable -Wunused-but-set-variable -Wno-maybe-uninitialized -fsanitize=address -fno-omit-frame-pointer -fsanitize=undefined -pthread -fopenmp -MD -MT caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/xnnpack/RegisterOpContextClass.cpp.o -MF caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/xnnpack/RegisterOpContextClass.cpp.o.d -o caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/xnnpack/RegisterOpContextClass.cpp.o -c /home/abhishekk/stantize/venv/pytorch/aten/src/ATen/native/xnnpack/RegisterOpContextClass.cpp In file included from /home/abhishekk/stantize/venv/pytorch/aten/src/ATen/core/boxing/KernelFunction.h:260, from /home/abhishekk/stantize/venv/pytorch/aten/src/ATen/core/dispatch/Dispatcher.h:4, from /home/abhishekk/stantize/venv/pytorch/torch/library.h:63, from /home/abhishekk/stantize/venv/pytorch/aten/src/ATen/native/xnnpack/RegisterOpContextClass.cpp:3: /home/abhishekk/stantize/venv/pytorch/aten/src/ATen/core/boxing/KernelFunction_impl.h: In instantiation of ‘static c10::KernelFunction c10::KernelFunction::makeFromUnboxedFunction(FuncPtr) [with FuncPtr = c10::CompileTimeFunctionPointer<c10::intrusive_ptr<at::native::xnnpack::LinearOpContext>(at::Tensor, std::optional<at::Tensor>, const std::optional<c10::Scalar>&, const std::optional<c10::Scalar>&), at::native::xnnpack::internal::linear::createLinearClampPrePackOpContext>; bool AllowLegacyTypes = false]’: /home/abhishekk/stantize/venv/pytorch/torch/library.h:133:59: required from ‘torch::CppFunction::CppFunction(FuncPtr, std::enable_if_t<c10::is_compile_time_function_pointer<FuncPtr>::value, std::nullptr_t>) [with FuncPtr = c10::CompileTimeFunctionPointer<c10::intrusive_ptr<at::native::xnnpack::LinearOpContext>(at::Tensor, std::optional<at::Tensor>, const std::optional<c10::Scalar>&, const std::optional<c10::Scalar>&), at::native::xnnpack::internal::linear::createLinearClampPrePackOpContext>; std::enable_if_t<c10::is_compile_time_function_pointer<FuncPtr>::value, std::nullptr_t> = std::nullptr_t]’ /home/abhishekk/stantize/venv/pytorch/torch/library.h:691:17: required from ‘torch::Library& torch::Library::impl(Name, Func&&, torch::_RegisterOrVerify) & [with Name = const char; Func = c10::CompileTimeFunctionPointer<c10::intrusive_ptr<at::native::xnnpack::LinearOpContext>(at::Tensor, std::optional<at::Tensor>, const std::optional<c10::Scalar>&, const std::optional<c10::Scalar>&), at::native::xnnpack::internal::linear::createLinearClampPrePackOpContext>]’ /home/abhishekk/stantize/venv/pytorch/torch/library.h:782:16: required from ‘torch::Library& torch::Library::impl(torch::detail::SelectiveStr<true>, Func&&) & [with Func = c10::CompileTimeFunctionPointer<c10::intrusive_ptr<at::native::xnnpack::LinearOpContext>(at::Tensor, std::optional<at::Tensor>, const std::optional<c10::Scalar>&, const std::optional<c10::Scalar>&), at::native::xnnpack::internal::linear::createLinearClampPrePackOpContext>]’ /home/abhishekk/stantize/venv/pytorch/aten/src/ATen/native/xnnpack/RegisterOpContextClass.cpp:87:9: required from here /home/abhishekk/stantize/venv/pytorch/aten/src/ATen/core/boxing/KernelFunction_impl.h:177:39: error: non-constant condition for static assertion 177 \| static_assert(FuncPtr::func_ptr() != nullptr, "Kernel function cannot be nullptr"); \| ~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~ ``` Testing* - Verified that PyTorch builds successfully with USE_ASAN=ON - Ran PyTorch test suite to ensure no regressions were introduced. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137446 Approved by: https://github.com/ezyang, https://github.com/jgong5	2024-10-13 03:31:54 +00:00
PyTorch MergeBot	70bd58c35f	Revert "Add support for add in tensorify_python_scalars fx pass (#137620 )" This reverts commit 0430e72e755d2c1953917ffb78f00c516eb4bbd5. Reverted https://github.com/pytorch/pytorch/pull/137620 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but it seems to cause test_torchbind_inductor to fail in trunk `0430e72e75` ([comment](https://github.com/pytorch/pytorch/pull/137620#issuecomment-2408784170))	2024-10-13 02:05:37 +00:00
PyTorch MergeBot	279052ab86	Revert "Add support for sub in tensorify_python_scalars fx pass (#137622 )" This reverts commit b7924610a0c20f72657548acef7743801189444a. Reverted https://github.com/pytorch/pytorch/pull/137622 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but it seems to cause test_torchbind_inductor to fail in trunk `0430e72e75` ([comment](https://github.com/pytorch/pytorch/pull/137620#issuecomment-2408784170))	2024-10-13 02:05:37 +00:00
Jason Ansel	5fee1ee3f4	[inductor] Refactor generate_workspace_allocation (#137673 ) And some other small changes Pull Request resolved: https://github.com/pytorch/pytorch/pull/137673 Approved by: https://github.com/Chillee ghstack dependencies: #137754	2024-10-13 01:25:14 +00:00
Jason Ansel	5146e6a96d	[inductor] Fix reduction_hint sum to single element (#137754 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137754 Approved by: https://github.com/Chillee, https://github.com/malfet	2024-10-13 01:08:23 +00:00
Bob Ren	b7924610a0	Add support for sub in tensorify_python_scalars fx pass (#137622 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137622 Approved by: https://github.com/ezyang ghstack dependencies: #137620	2024-10-13 00:30:02 +00:00
Nichols A. Romero	bd63ec4f45	[ROCm] LoadHIP CMake cleanup (#137112 ) Should help mitigate issues reported here: https://github.com/pytorch/pytorch/issues/128313 While working on https://github.com/pytorch/pytorch/pull/136700, we realized that some of the ROCm CMake can be streamlined. This PR does not fix any bugs or provide any new functionality. Strictly clean-up. The remaining `${ROCM_ROCTX_LIB}` will be removed when we transition to the rocprofiler-sdk (to be done in a separate PR). Pull Request resolved: https://github.com/pytorch/pytorch/pull/137112 Approved by: https://github.com/jithunnair-amd, https://github.com/jeffdaily	2024-10-13 00:06:41 +00:00
zeshengzong	47c8aa8090	Refactor make device agnostic in accelerator hooks (#137558 ) Make `AcceleratorHooksInterface` consistent for multiple accelerators - Add `getDeviceFromPtr` method declaration in `AcceleratorHooksInterface` - Fix clangtidy warning Pull Request resolved: https://github.com/pytorch/pytorch/pull/137558 Approved by: https://github.com/FFFrog, https://github.com/ezyang	2024-10-12 18:13:54 +00:00
Bob Ren	0430e72e75	Add support for add in tensorify_python_scalars fx pass (#137620 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137620 Approved by: https://github.com/ezyang ghstack dependencies: #136674, #137588	2024-10-12 17:18:27 +00:00
Wei Wang	e89fe0bd6e	Updating cuda binary build to get cusparselt from PYPI (#137653 ) Fixes #137374 Update 1: such PR require Meta uploading the PYPI package to download.pytorch.org. See: ERROR: Could not find a version that satisfies the requirement nvidia-cusparselt-cu12==0.6.2; platform_system == "Linux" and platform_machine == "x86_64" (from torch) (from versions: none) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137653 Approved by: https://github.com/eqy, https://github.com/Skylion007, https://github.com/atalman	2024-10-12 16:40:37 +00:00
Avik Chaudhuri	ed55d356de	[alt] fix unroll in successive unflatten (#137646 ) We use nn_module_stack in unflatten to recognize when module calls begin and end. However the current format is not sufficient to detect module call boundaries when we have successive calls to the same module, because the successive instructions (end of one call, begin of next call) have the same nn_module_stack. This causes us to effectively "unroll" successive calls to a single call. This can cause problems when preserving module call signatures because the outputs of the successive calls might be concatenated in the single call. Previously we introduced the concept of a "call index" to generate multiple graphs when unflattening, one per call. This PR pushes this concept into nn_module_stack itself. In particular, the keys of nn_module_stack now go from `key` to `key@call_index`. (In a previous attempt, https://github.com/pytorch/pytorch/pull/137457, instead values in nn_module_stack go from (fqn, type) to (fqn, type, call_index), which is BC-breaking.) Note that we still do not have the ability to preserve module call signatures for multiple calls to the same module. But now instead of randomly crashing we give a proper error. OTOH when not preserving module call signatures we simply generate multiple calls, each with its own graph, possibly deduplicated, matching what we would do for non-successive calls. Test Plan: Like D64014936 Differential Revision: D64136277 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137646 Approved by: https://github.com/angelayi	2024-10-12 15:53:52 +00:00
yanbing-j	561f07fae7	Warn users of mkldnn device usage (#137553 ) In https://github.com/pytorch/pytorch/issues/136831, user will use mkldnn device to generate tensor, while mkldnn device is no longer used as device type, and only mkldnn layout is used. We plan to remove mkldnn device related code in the future release. This PR is to warn users not to use mkldnn device. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137553 Approved by: https://github.com/jgong5, https://github.com/ezyang	2024-10-12 13:42:12 +00:00
Li, Xingyuan	0dbbcfa7ae	[Inductor UT] Generalize newly introduced inductor UTs for intel GPU (Part 3) (#136947 ) [Inductor UT] Generalize Newly introduced inductor UTs for intel GPU reuse `test/inductor/test_pattern_matcher.py` reuse `test/inductor/test_snode_runtime.py` reuse `test/inductor/test_unbacked_symints.py` fix `test/inductor/test_triton_kernels.py` Pull Request resolved: https://github.com/pytorch/pytorch/pull/136947 Approved by: https://github.com/etaf, https://github.com/EikanWang, https://github.com/jansel	2024-10-12 13:21:20 +00:00
Yukio Siraichi	030ba03681	Add meta functions for `lerp`, `addcmul`, and `addcdiv`. (#136909 ) This PR adds new meta functions for `lerp`, `addcmul`, and `addcdiv` (including their respective inplace versions). These functions only had refs implementations, which was being the root cause of a significant overhead ([issue][1]) when running `AdamW` optimizer step on PyTorch/XLA backend. Running the meta functions resulted in the following improvements: - `lerp` calls: 1,550ms to 140ms (10x) - `addcdiv` calls: 640ms to 350ms (1.8x) - `addcmul` calls: 620ms to 300ms (2.05x) [1]: https://github.com/pytorch/xla/issues/7923 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136909 Approved by: https://github.com/jansel	2024-10-12 12:40:46 +00:00
Jovian Anthony Jaison	6001b16597	Add entire _dynamo.config as a json for logging (#137216 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137216 Approved by: https://github.com/ezyang Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>	2024-10-12 11:48:59 +00:00
Angel Yang	a777dea3b3	Remove dtype check on meta device (#136774 ) Summary: # Latest Update This diff is no longer needed because we did need the check to exist, to make meta behave the same as other devices, see D54526190. --------------------------------- # Background T176105639 \| case \| embedding bag weight \| per_sample_weight \| fbgemm lookup \| forward in meta \| \| A \| fp32 \| fp32 \| good \| good \| \| B \| fp16 \| fp32 \| good\| failed [check](https://fburl.com/code/k3n3h031) that forces weight dtype == per_sample_weights dtype \| \| C \| fp16 \| fp16 \| P1046999270, RuntimeError: "expected scalar type Float but found Half from fbgemm call" \| good \| \| D \| fp32 \| fp16 \| N/A \| N/A \| Currently we are in case A. Users need to add `use_fp32_embedding` in training to force embedding bag dtype to be fp32. However, users actually hope for case B to use fp16 as the embedding bag weight. When deleting `use_fp32_embedding`, they would fail the [check](https://fburl.com/code/k3n3h031) that forces `weight dtype == per_sample_weights dtype ` in meta_registration. The check is actually not necessary. Is it because the backend fbgemm does support case B. Additionally, later on in the `meta_embedding_bag`, `weight` and `per_sample_weights` don't need to be in the same dtype (https://fburl.com/code/q0tho05h, weight is src, per_sample_weights is scale) for `is_fast_path_index_select`. # This diff Therefore, this diff remove the unnecessary [check](https://fburl.com/code/k3n3h031) to support case B in meta forward. With such, users are able to use fp16 to be the emb bag dtype without the need to force per_sample_weights the same dtype in meta forward (see Test Plan). # Reference diffs to resolve this issue Diff 1: D52591217 This passes embedding bag dtype to feature_processor to make per_sample_weights same dtype as emb bag weight. However, `is_meta` also needs to be passed because of case C. fbgemm still does not support per_sample_weights = fp16 (see the above table). Therefore users are forced to only make per_sample_weights fp16 when it is on meta. The solution requires too many hacks. Diff 2: D53232739 Basically doing the same thing in diff 1 D52591217, except that the hack is added in TorchRec library. This adds an if in EBC and PEA for: when emb bag weight is fp16, it forces per_sample_weight fp16 too. However, it would then result in fbgemm issue too and has broken a bunch of prod models. Test Plan: # APS The following command will run icvr_launcher which triggers ads_launcher and run forward in meta device: ``` buck2 run mode/opt -c python.package_style=inplace //aps_models/ads/icvr:icvr_launcher_publish -- mode=mast_ig_fm_when_combo0_uhm_publish launcher.fbl_entitlement=ads_global_tc_ads_score launcher.data_project=oncall_ads_model_platform launcher.tags=[ads_ranking_taxonomy_exlarge_fm_prod] stages.train=false ``` Result: {F1461463993} Reviewed By: ezyang Differential Revision: D54175438 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136774 Approved by: https://github.com/ezyang	2024-10-12 05:45:21 +00:00
Huy Do	92cc319120	Fix masked tensor test_stack memory leak (#137815 ) This test is currently failing in trunk when memory leak check is enabled, for example https://github.com/pytorch/pytorch/actions/runs/11296206361/job/31422348823#step:22:1970. When testing locally, calling `backward` on a masked tensor always causes memory leak until I clean up the data and the mask manually. This is probably related to this warning from masked tensor `UserWarning: It is not recommended to create a MaskedTensor with a tensor that requires_grad. To avoid this, you can use data.clone().detach()`, but I don't know much about the internal details here to go further. So, let's just fix the test first/ ### Testing ``` PYTORCH_TEST_CUDA_MEM_LEAK_CHECK=1 python test/test_maskedtensor.py TestBasicsCUDA.test_stack_cuda ``` passes and doesn't warn about memory leak anymore. The test itself came from https://github.com/pytorch/pytorch/pull/125262#issuecomment-2344068012 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137815 Approved by: https://github.com/kit1980	2024-10-12 04:30:57 +00:00
Jez Ng	c8609cf4b0	[inductor] Update Triton CPU pin (#137778 ) This incorporates the fix in https://github.com/triton-lang/triton/pull/4871. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137778 Approved by: https://github.com/Skylion007	2024-10-12 03:09:09 +00:00
Eddie Yan	d52b2cf92f	[CUDA][SDPA] Fix TF32 handling and bump threshold for multiheadattention test (#137752 ) For sm90, main issue was that `torch.testing.assert_close` bypasses the `tf32_on_and_off` tolerance switch decorator Pull Request resolved: https://github.com/pytorch/pytorch/pull/137752 Approved by: https://github.com/ezyang	2024-10-12 03:05:21 +00:00
Haifeng Jin	2db3f85894	Fixes NumPy 2 test failures in test_torch.py (#137740 ) Related to #107302 The breakages are caused by backward incompatibility between NumPy 1 and NumPy 2. This PR fixes all the corresponding test failures in `test_torch.py`. 1. The dtype of the return value `np.percentile` when passed a `torch.float32` tensor. NumPy 1: Return value of `np.float64`. NumPy 2: Return value of `np.float32`. Solution: Enforce it with `.astype(np.float64)`. 2. The type of `np.gradient()` when returning multiple arrays. NumPy1: A list of arrays. NumPy2: A tuple of arrays. Solution: Cast the tuple to a list. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137740 Approved by: https://github.com/ezyang	2024-10-12 02:40:17 +00:00
eqy	6be53d52c5	[CUDA][SDPA] Bump tolerances for `grad_query` in mem_eff test (#137750 ) (for sm80) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137750 Approved by: https://github.com/drisspg	2024-10-12 02:15:14 +00:00
Valentine233	67883e70c0	change GPT2ForSequenceClassification inference accuracy tolerance (#136749 ) Fixes https://github.com/pytorch/pytorch/issues/123503. https://github.com/pytorch/pytorch/pull/121866 makes GPT2ForSequenceClassification hit the SDPA pattern 18 and then encounter the accuracy issue. The issue only happens with BF16 inference single thread. This PR tends to increase the model tolerance from 4e-3 to 5e-3 and make the check pass. Note that the issue is due to some small implementation diff. For example, the sdpa math backend scales q, k before matmul for stability; the flash attention backend has more diffs as a new algorithm. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136749 Approved by: https://github.com/jgong5, https://github.com/jansel	2024-10-12 01:12:28 +00:00
Gufan Yin	fba2c0a23a	Fix comment in ProcessGroupGloo (#137746 ) Summary: Algorithm caching was removed in 2018 D13111781 Test Plan: CI Differential Revision: D64214527 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137746 Approved by: https://github.com/Skylion007, https://github.com/wz337	2024-10-12 01:04:41 +00:00
Jean Schmidt	69bcf1035e	Updates reference to _runner-determinator.yml workflow, from current version to main version. (#137791 ) Updates all references to runner determinator workflow (`_runner-determinator.yml`) from current cloned version to main version. This enables the team to push updates to this workflow, like fixing bugs or pushing improvements, and have it immediately be reflected on all open PRs. So avoiding potentially breaking situations, empowering moving fast and fast and simple recover in case of bugs. From: ``` jobs: get-label-type: uses: ./.github/workflows/_runner-determinator.yml ``` To: ``` jobs: get-label-type: uses: pytorch/pytorch/.github/workflows/_runner-determinator.yml@main ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/137791 Approved by: https://github.com/malfet, https://github.com/huydhn, https://github.com/zxiiro	2024-10-12 00:18:50 +00:00
Andrew Gu	e269a5cb09	[TCPStore] Throw value error if passing `world_size=0` to TCPStore (#137792 ) This fixes https://github.com/pytorch/pytorch/issues/137577. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137792 Approved by: https://github.com/fegin, https://github.com/H-Huang ghstack dependencies: #137713, #137721	2024-10-11 23:42:57 +00:00
cyyever	25ac5652d0	[Environment Variable][3/N] Use thread-safe getenv wrapper (#137328 ) Follows #124485 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137328 Approved by: https://github.com/eqy	2024-10-11 23:23:57 +00:00
Shivam Raikundalia	8486d3df69	[Profiler] Hide ProfilerStep Alignment behind Experimental Config (#137668 ) Summary: Aligning ProfilerStep# annotation can be useful for visual purposes but it affects downstream tools like HTA to misreport how long each step took. For this reason, lets give users the option to turn on this alignment manually but also turn it off by default Test Plan: Alignment off: https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/traces/dynocli/devvm2185.cco0.facebook.com/rank-0.Oct_09_16_11_48.2543945.pt.trace.json.gz&bucket=gpu_traces Alignment on: https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/traces/dynocli/devvm2185.cco0.facebook.com/rank-0.Oct_09_16_08_27.2518391.pt.trace.json.gz&bucket=gpu_traces Differential Revision: D64146115 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137668 Approved by: https://github.com/aaronenyeshi	2024-10-11 22:57:05 +00:00
PyTorch MergeBot	0121d64aa9	Revert "[AOTI] Handle inplace output in ProxyExecutor (#137660 )" This reverts commit 573101aac3b1addc0a0b945ae09fe9be9034d3a9. Reverted https://github.com/pytorch/pytorch/pull/137660 on behalf of https://github.com/desertfire due to Fails in fbcode ([comment](https://github.com/pytorch/pytorch/pull/137660#issuecomment-2408213485))	2024-10-11 22:54:39 +00:00
PyTorch MergeBot	c58e5c4efa	Revert "[AOTI] Turn on the ABI-compatible mode as default (#136534 )" This reverts commit b0da076f0cd5957c7fe55a58876f3b74babfc1b7. Reverted https://github.com/pytorch/pytorch/pull/136534 on behalf of https://github.com/desertfire due to The dependent PR https://github.com/pytorch/pytorch/pull/137660 fails in fbcode ([comment](https://github.com/pytorch/pytorch/pull/136534#issuecomment-2408211238))	2024-10-11 22:50:58 +00:00
Will Constable	e3173d8725	[pipelining] Shape Inference (#136912 ) Performs shape inference at runtime using user-provided real tensors. - avoids the need for users to precompute shapes which is difficult and error prone - lets us remove args from the PipelineStage ctor (in a later PR) - deprecates existing inference helper in PipelineStage constructor for several reasons: its problematic to have to reason about the stage submod being on the right device for shape inference The current state as of this PR: - Users should not pass any input or output shapes into PipelineStage ctor, and shape inference will run automatically - To override shape inference, they can continue to pass input/output args as previously Currently, does not add a barrier after shape-inference, which essentially pipelines shape inference with the subsequent schedule action for that stage. If this complicates debugging, we could add in a barrier (it comes at a cost, but only during the first step). Testing: - Removed input args from all PP test cases, thus exposing them all to shape-inference. - Verified visually (nvidia-smi) that torchtitan PP 3D test runs shape inference fine without creating extra cuda contexts. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136912 Approved by: https://github.com/kwen2501, https://github.com/H-Huang	2024-10-11 22:49:00 +00:00
Shangdi Yu	432c3fe5af	Default to use training IR (#137804 ) Summary: Since capture_pre_autograd_graph is deprecated and will be deleted soon, we default this option to true. Test Plan: CI Reviewed By: tugsbayasgalan Differential Revision: D64254236 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137804 Approved by: https://github.com/tugsbayasgalan	2024-10-11 22:34:28 +00:00
Jez Ng	c254901bdb	Have Triton custom extension test use privateuseone device (#137611 ) The original PR #122396 used the CPU device since at that point in time there was no actual Triton CPU backend. After #133408, this is no longer the case, so we now have multiple backends getting registered for the CPU. The test still works in OSS but fails internally due to different test runners initializing the backends in a different order. This PR doesn't actually end up fixing the test internally because cpp_extension -- needed to implement the privateuseone device -- isn't supported there, so we simply skip it instead. However, it still makes the OSS test independent of initialization order, which is good. Differential Revision: [D63838169](https://our.internmc.facebook.com/intern/diff/D63838169/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137611 Approved by: https://github.com/henrylhtsang	2024-10-11 21:27:29 +00:00
Bilal Khan	19bbbef79d	cublaslt autotuning support for TunableOp (#133896 ) Adds support for cublaslt autotuning to TunableOp. Todo: - [x] Add and test `ScaledGemmTunableOp` - [x] Benchmarking numbers Pull Request resolved: https://github.com/pytorch/pytorch/pull/133896 Approved by: https://github.com/eqy, https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2024-10-11 21:16:36 +00:00
PyTorch MergeBot	1358969fa1	Revert "BundledAutotuneCache (#134959 )" This reverts commit 709021143d9c9aa90df578a2f5abb93a91a4852a. Reverted https://github.com/pytorch/pytorch/pull/134959 on behalf of https://github.com/albanD due to The newly added test fails on rocm CI ([comment](https://github.com/pytorch/pytorch/pull/134959#issuecomment-2408091754))	2024-10-11 20:43:56 +00:00
Artemiy Bulavin	74e871355b	Add hooks to Scheduler nodes for generating device-specific debug strings (#135015 ) Previously, instances of `SchedulerNode` and `FusedSchedulerNode` would explicitly check whether the compilation target is Triton when codegen'ing debug strings. Generating debug triton code is instead implemented as a callback set on scheduler nodes by `TritonScheduling`. This makes the codegen more device-agnostic and allows schedulers to customise the codegen output as opposed to it being closely coupled to the debug string codegen Pull Request resolved: https://github.com/pytorch/pytorch/pull/135015 Approved by: https://github.com/jansel	2024-10-11 20:30:49 +00:00
eellison	8543000c27	Search through config changes in compiler bisector (#137346 ) Follow up to https://github.com/pytorch/pytorch/pull/131936. In the original bisector you'd have to test inline if we were disabling a component - `if BisectionManager.disable_subsystem("inductor", "post_grad_passes", debug_info)`. This adds a convenient way of testing config changes for root causing issue. I've added `emulate_precision_casts` and aot_eager_decomp_partition cse as initial ones. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137346 Approved by: https://github.com/zou3519	2024-10-11 20:24:54 +00:00
Ryan Landay	513563eb09	Fix stack named "queue" in Util::ComputePostOrder (#130526 ) This function computes a topological sort using a non-recursive implementation of DFS. Upon first reading, I thought it was using Kahn’s algorithm because it uses a variable called `queue`, but upon closer reading, I noticed this variable is actually used as a stack. This pull request improves readability by renaming the stack and changing it from `std::vector` to `std::stack`. Note: this also changes the backing store from an `std::vector` to an `std::deque`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130526 Approved by: https://github.com/alanwaketan, https://github.com/malfet	2024-10-11 20:21:07 +00:00
Justin Chu	d0628a7e39	[ONNX] Remove deprecated export_to_pretty_string (#137790 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137790 Approved by: https://github.com/titaiwangms ghstack dependencies: #137789	2024-10-11 20:10:04 +00:00
Tugsbayasgalan Manlaibaatar	5fca2fd365	Try unify training and inference (#136888 ) Previously inference -> inference IR was going through a seperate flow from train -> inference decomposition. This diff unifies them so that we always retrace when decomposing. Joint IR decomp is still going through old flow (inference -> inference) but seems ok for now since it is still in experimental stage. Differential Revision: [D63062521](https://our.internmc.facebook.com/intern/diff/D63062521/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136888 Approved by: https://github.com/avikchaudhuri	2024-10-11 20:09:58 +00:00
Justin Chu	3e0b83ad1f	[ONNX] Remove ExportTypes (#137789 ) Remove deprecated ExportTypes and the `_exporter_states` module. Only protobuf (default) is supported going forward. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137789 Approved by: https://github.com/titaiwangms	2024-10-11 19:29:52 +00:00
Sergii Dymchenko	460358a20f	Run lint-autoformat only on PRs to main (#137802 ) This is mostly to prevent showing up on ghstack PRs, with which code suggestions are not compatible. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137802 Approved by: https://github.com/huydhn	2024-10-11 19:25:34 +00:00
Jean Schmidt	2cb983ab97	[CI] Adds support for selecting experiments for workflows on runner determinator (#137614 ) adds a `default` tag to experiment configurations, allowing to remove some experiments by default on the random draw: ``` experiments: lf: rollout_perc: 25 otherExp: rollout_perc: 25 default: false --- ``` and includes the configuration to filter what experiments are of interest for a particular workflow (comma separated): ``` get-test-label-type: name: get-test-label-type uses: ./.github/workflows/_runner-determinator.yml with: ... check_experiments: "awsa100" ``` The end goal, is to enable us to run multiple experiments, that are independent from one another. For example, while we still runs the LF infra experiment, we want to migrate other runners leveraging the current solution. A immediate UC is for the A100 instances, where we want to migrate to AWS. Those new instances will during the migration period be labeled both `awsa100.linux.gcp.a100` and `linux.aws.a100`. Once the experiment ends, we will remove the first confusing one. ``` jobs: get-build-label-type: name: get-build-label-type uses: ./.github/workflows/_runner-determinator.yml with: ... get-test-label-type: name: get-test-label-type uses: ./.github/workflows/_runner-determinator.yml with: ... check_experiments: "awsa100" linux-focal-cuda12_1-py3_10-gcc9-inductor-build: name: cuda12.1-py3.10-gcc9-sm80 uses: ./.github/workflows/_linux-build.yml needs: - get-build-label-type - get-test-label-type with: runner_prefix: "${{ needs.get-build-label-type.outputs.label-type }}" ... test-matrix: \| { include: [ { config: "inductor_huggingface_perf_compare", shard: 1, num_shards: 1, runner: "${{ needs.get-test-label-type.outputs.label-type }}linux.gcp.a100" }, ... ]} ... ``` ``` experiments: lf: rollout_perc: 50 awsa100: rollout_perc: 50 default: false ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/137614 Approved by: https://github.com/malfet	2024-10-11 19:20:02 +00:00
Aaron Orenstein	709021143d	BundledAutotuneCache (#134959 ) Add a cache to combine individual autotune caches into a single cached bundle. We still rely on the individual autotune caches - on a cache hit we copy the individual results into the local caches so they can retrieved later. Various related configs: env: TORCHINDUCTOR_BUNDLED_AUTOTUNE_REMOTE_CACHE config: bundled_autotune_remote_cache jk: pytorch/remote_cache:bundled_autotune_remote_cache_version Testing: Manually tested w/ EMU: ``` cd fbcode/accelerators/workloads/models/emu_flash/v1p4 make build_benchmark_model && make save_model_to_path make test_pt2_latency ``` - on a cold run we got 0 hits and 40 misses. On a warm run it got 40 hits and 0 miss. - perf seems a little better - for 8 runs: - no bundled cache averaged 14m11s - bundled cache averaged 14m6s - 125ms saved per cache entry seems reasonable Cache Metrics for an sample run: no bundled cache: ``` INFO: Cache Metrics: FbMemcacheRemoteKernelCache: {hit: 2256, miss: 0, put: 0, exception: 0} FbRemoteAutotuneCache: {hit: 0, miss: 0, put: 7, exception: 0} FbRemoteFxGraphCache: {hit: 40, miss: 0, put: 0, exception: 0} LocalAutotuneCache: {hit: 878, miss: 0, put: 7, exception: 0} backend:MemcacheCache: {hit: 2256, miss: 0, put: 7, exception: 0} backend:_LocalAutotuneCacheBackend: {hit: 878, miss: 0, put: 7, exception: 0} backend:_ManifoldCache: {hit: 40, miss: 0, put: 0, exception: 0} ``` bundled cache: ``` INFO: Cache Metrics: FbMemcacheRemoteKernelCache: {hit: 2258, miss: 0, put: 0, exception: 0} FbRemoteAutotuneCache: {hit: 0, miss: 0, put: 8, exception: 0} FbRemoteBundledAutotuneCache: {hit: 40, miss: 0, put: 0, exception: 0} FbRemoteFxGraphCache: {hit: 40, miss: 0, put: 0, exception: 0} LocalAutotuneCache: {hit: 878, miss: 0, put: 886, exception: 0} backend:MemcacheCache: {hit: 2258, miss: 0, put: 8, exception: 0} backend:_LocalAutotuneCacheBackend: {hit: 878, miss: 0, put: 886, exception: 0} backend:_ManifoldCache: {hit: 80, miss: 0, put: 0, exception: 0} ``` Differential Revision: D60677499 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134959 Approved by: https://github.com/oulgen	2024-10-11 19:12:41 +00:00
chilli	b82000c1b3	Removed _compile workaround for create_block_mask (#137477 ) I also put in a change for supporting `create_block_mask` to properly handle non-multiples of BLOCK_SIZE. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137477 Approved by: https://github.com/drisspg, https://github.com/BoyuanFeng	2024-10-11 19:04:23 +00:00
Jason Ansel	2dcd69da50	[inductor] Delete dead code and lints (#137753 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137753 Approved by: https://github.com/Chillee	2024-10-11 18:55:08 +00:00
Xuehai Pan	267f82b860	[BE] Format `.ci/` / `.github/` / `benchmarks/` / `functorch/` / `tools/` / `torchgen/` with `ruff format` (#132577 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132577 Approved by: https://github.com/malfet	2024-10-11 18:30:26 +00:00
Animesh Jain	04adb74d08	[inductor][cond] Remove redundant prefix (#137718 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137718 Approved by: https://github.com/eellison ghstack dependencies: #137200	2024-10-11 18:13:18 +00:00
Animesh Jain	cd02c85ba4	[inductor][subgraph][python-wrapper] Lift subgraph code into functions (#137200 ) Earlier the subgraphs were getting inlined into the output code. This PR lifts the subgraphs into a function, and then we just call the function in the output code. This is the output code for test `test_cond_reintepret_view_inputs_outputs` Before this PR - https://www.internalfb.com/intern/paste/P1632948905/ With this PR - https://www.internalfb.com/intern/paste/P1632946348/ A relevant snippet from the above paste is ~~~ def false_graph_0(args): false_graph_0_arg0_1, false_graph_0_arg1_1, s0 = args args.clear() s0 = s0 with torch.cuda._DeviceGuard(0): torch.cuda.set_device(0) false_graph_0_buf0 = empty_strided_cuda(((-1) + s0, 20), (20, 1), torch.float32) false_graph_0_buf1 = empty_strided_cuda(((-1) + s0, 20), (20, 1), torch.float32) # Unsorted Source Nodes: [cond, z1, z2], Original ATen: [aten.sub, aten.add] triton_poi_fused_add_sub_1_xnumel = (-20) + (20s0) stream0 = get_raw_stream(0) triton_poi_fused_add_sub_1.run(false_graph_0_arg0_1, false_graph_0_arg1_1, false_graph_0_buf0, false_graph_0_buf1, triton_poi_fused_add_sub_1_xnumel, grid=grid(triton_poi_fused_add_sub_1_xnumel), stream=stream0) del false_graph_0_arg0_1 del false_graph_0_arg1_1 return (reinterpret_tensor(false_graph_0_buf0, ((-3) + s0, 20), (20, 1), 40), reinterpret_tensor(false_graph_0_buf1, ((-1) + s0, 16), (20, 1), 4), ) async_compile.wait(globals()) del async_compile def call(args): arg0_1, arg1_1, arg2_1, arg3_1 = args args.clear() s0 = arg0_1 assert_size_stride(arg1_1, (s0, 20), (20, 1)) assert_size_stride(arg2_1, (s0, 20), (20, 1)) assert_size_stride(arg3_1, (), ()) with torch.cuda._DeviceGuard(0): torch.cuda.set_device(0) buf0 = [None] 2 buf0 = [None] * 2 if arg3_1.item(): # subgraph: true_graph_0 true_graph_0_arg0_1 = reinterpret_tensor(arg1_1, ((-1) + s0, 20), (20, 1), 0) true_graph_0_arg1_1 = reinterpret_tensor(arg2_1, ((-1) + s0, 20), (20, 1), 0) (true_graph_0_buf0, true_graph_0_buf1) = true_graph_0([true_graph_0_arg0_1, true_graph_0_arg1_1, s0]) buf0[0] = true_graph_0_buf0 buf0[1] = true_graph_0_buf1 else: # subgraph: false_graph_0 false_graph_0_arg0_1 = reinterpret_tensor(arg1_1, ((-1) + s0, 20), (20, 1), 0) false_graph_0_arg1_1 = reinterpret_tensor(arg2_1, ((-1) + s0, 20), (20, 1), 0) (false_graph_0_buf0, false_graph_0_buf1) = false_graph_0([false_graph_0_arg0_1, false_graph_0_arg1_1, s0]) buf0[0] = false_graph_0_buf0 buf0[1] = false_graph_0_buf1 del arg1_1 del arg2_1 del arg3_1 buf1 = buf0[0] buf2 = buf0[1] del buf0 return (buf1, buf2, ) ~~~ The key change is to recursively call `codegen` for the subgraph and rely on `SubgraphPythonWrapper` to generate just the subgraph `fn`. The resulting subgraph_code is then inserted into the parent wrapper. Note that this PR only works for python wrapper. For cpp wrapper, we need a lot of refactor to ensure that we don't duplicate the global variables in the outpute_code. So, for now, I fallback to the old way of inlining for cpp wrapper. I am hoping someone with more familiarity with cpp wrapper can support subgraph lifting (cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire @chauhang @aakhundov). This work will unblock hierarchical compilation (or cold start compile time work). Pull Request resolved: https://github.com/pytorch/pytorch/pull/137200 Approved by: https://github.com/desertfire, https://github.com/eellison	2024-10-11 17:57:10 +00:00
Nikita Shulga	68272ab596	Extend cuda_flip to unsigned types (#137781 ) Using AT_DISPATCH_V2 Test plan: `python3 -c "import torch;print(torch.randint(0, 100, (4, 4), dtype=torch.uint16, device='cuda').flip(0))"` Fixes https://github.com/pytorch/pytorch/issues/137770 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137781 Approved by: https://github.com/Skylion007	2024-10-11 17:02:53 +00:00
Nichols A. Romero	4fa46d3bda	TunableOp: Performance Improvement (#135371 ) This PR reduces the overhead on the CPU side by eliminating the use of c10::str in creating signatures. Instead we use fmt library. TunableOp overhead on the CPU are reduced by around ~40%. The improvement is most noticeable on small GEMMs. This PR does not contain any bug fixes or new features. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135371 Approved by: https://github.com/jeffdaily	2024-10-11 16:52:40 +00:00
Jeff Daily	da578495ca	[ROCm] enable gfx110x for hipblaslt (#137317 ) Fixes #136347. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137317 Approved by: https://github.com/Skylion007, https://github.com/jithunnair-amd Co-authored-by: Nichols A. Romero <nick.romero@amd.com>	2024-10-11 16:51:31 +00:00
James Wu	41ccfc8752	Log chromium event for automatic dynamic reasons (#137491 ) Log a chromium event so that we can see the reasons for invoking automatic dynamic shapes in aggregate internally. Run following code: ``` import torch @torch.compile(backend="eager") def foo(t, x): return t.sin() + x torch._dynamo.config.automatic_dynamic_shapes = True torch._dynamo.config.assume_static_by_default = True # Change size x = torch.randn([1,2]) foo(x, 2) x = torch.randn([2,2]) foo(x, 2) torch._dynamo.reset() # Change dimensionality x = torch.randn([1,2]) foo(x, 2) x = torch.randn([1,2,3]) foo(x, 2) torch._dynamo.reset() # Change stride x = torch.randn([3,3]) foo(x, 2) x = torch.as_strided(x, [3,3], [2,2]) foo(x, 2) torch._dynamo.reset() # Change scalar x = torch.randn([1,2]) foo(x, 2) foo(x, 3) ``` Internal link to perfetto: https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html?url=https%3A%2F%2Finterncache-all.fbcdn.net%2Fmanifold%2Ftlparse_reports%2Ftree%2Flogs%2Fjjwu%2Fcustom%2Fchromium_events.json#!/viewer?url=https%3A%2F%2Finterncache-all.fbcdn.net%2Fmanifold%2Ftlparse_reports%2Ftree%2Flogs%2Fjjwu%2Fcustom%2Fchromium_events.json&local_cache_key The events look like this: <img width="639" alt="image" src="https://github.com/user-attachments/assets/23916333-7f24-47c7-934b-201f33aebeab"> <img width="638" alt="image" src="https://github.com/user-attachments/assets/9f927c8d-04bb-4431-8802-685b032df656"> <img width="640" alt="image" src="https://github.com/user-attachments/assets/342e9e11-0dfc-422d-bd0b-01a8574d38ba"> <img width="635" alt="image" src="https://github.com/user-attachments/assets/dc2c97cd-7180-4069-b019-d6e63ee490bc"> Differential Revision: [D64184625](https://our.internmc.facebook.com/intern/diff/D64184625) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137491 Approved by: https://github.com/Skylion007, https://github.com/oulgen	2024-10-11 16:50:25 +00:00
Laith Sakka	a06d49a9f9	bump up add_loop_inductor_gpu expected instruction count. (#137672 ) diff https://github.com/pytorch/pytorch/pull/137117/files increased instruction count for add_loop_inductor_gpu but not enough to fail in that diff, but now its kind of flaky test . it failed on recent merge: <img width="1351" alt="Screenshot 2024-10-09 at 5 25 57 PM" src="https://github.com/user-attachments/assets/27178f76-c08e-4d13-9ac4-4cd70f146611"> and here is the history <img width="1047" alt="Screenshot 2024-10-09 at 5 26 07 PM" src="https://github.com/user-attachments/assets/bd563e34-6f7f-461a-ae54-8a616be9bf09"> <img width="777" alt="Screenshot 2024-10-09 at 5 30 19 PM" src="https://github.com/user-attachments/assets/d0a1ca81-2bdb-4cf6-8ac8-ba5971d447bf"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/137672 Approved by: https://github.com/masnesral	2024-10-11 16:46:38 +00:00
Aaron Gokaslan	d41558f8d7	[BE][Ez]: Better error message for CUDNN attention attn_bias (#137702 ) Follow up to #136885 . Fixes edge case on error condition (should be early exit so that expand doesn't every run into any trouble with weird cases (attn_bias 0, 1, > 5 dim). Pull Request resolved: https://github.com/pytorch/pytorch/pull/137702 Approved by: https://github.com/eqy	2024-10-11 16:44:46 +00:00
Andrew Gu	5835b1af10	[FSDP2] Gated dynamo import for torch deploy (#137203 ) Differential Revision: [D63777335](https://our.internmc.facebook.com/intern/diff/D63777335) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137203 Approved by: https://github.com/wz337	2024-10-11 16:38:19 +00:00
Andrew Gu	bdb42e7c94	[PGNCCL] Added some missing spaces in barrier msg (#137721 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137721 Approved by: https://github.com/kwen2501 ghstack dependencies: #137713	2024-10-11 15:17:25 +00:00
Andrew Gu	39c5048549	[DeviceMesh] Fixed `from_group` when passing tensor `mesh` (#137713 ) This fixes https://github.com/pytorch/pytorch/issues/137676. (sorry for messing this up in the original PR 😓 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137713 Approved by: https://github.com/wz337	2024-10-11 14:53:51 +00:00
Jiong Gong	e30c55ee52	Update maintainers for inductor and x86 CPU (#136839 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136839 Approved by: https://github.com/Skylion007, https://github.com/albanD, https://github.com/malfet	2024-10-11 07:24:07 +00:00
drisspg	1c71de5b2c	[ScaleMM] Add a shape dependent max_swizzle size (#137681 ) # Summary I started to explore the performance of _scaled_mm against a triton-based persistent TMA kernel for RowWise scaling. There are more details here: https://github.com/drisspg/transformer_nuggets/pull/36 It clearly showed that where was some room for improvement on larger problem sizes compared to triton's performance. Note that the triton kernel only has a 128x128x128 Tile shape, where scaled_mm has a 64, 128, 128 tile shape which we use for smaller problem sizes which may explain some of the perf delta for at smaller shapes. This led to seeing if we can improve our triton codegen lowering for _scaled_mm (I think we should still do this: https://github.com/pytorch/pytorch/pull/137517). In the meantime @Chillee suggested I make sure swizziling is set for the large matmul shapes This PR makes sure that we increase the max_swizzle_size for the large matmuls. ## Performance Note* Red means triton based tma beats _scaled_mm blue means _scaled_mm is faster On Nighlty W/ Triton at (2ef33c6c4c3) ![swizzle_tst_8_full_nightly_heatmaps](https://github.com/user-attachments/assets/e92af19b-4e79-4126-b9d0-da039da5363b) You can see that as M,K,N increase there is a clear win for the Triton Persistent TMA. After this PR: ![swizzle_tst_8_full_heatmaps](https://github.com/user-attachments/assets/472068b3-45c2-43f8-84d3-b116da7898d5) For example w/ this change(power limited gpu) M=16384 K=16384 N=16384 TFlops Before :`985.49` TFlops After: `1304.69` Pull Request resolved: https://github.com/pytorch/pytorch/pull/137681 Approved by: https://github.com/eqy	2024-10-11 06:44:31 +00:00
Xia, Weiwen	4e309899c7	[Quant] Check stride > 0 for QConv and QConvTranspose (#136739 ) Fixes #136722 Fixes #136718 By default, it goes to onednn. So this PR adds a check to ensure stride > 0. Now program will quit with an error message if stride is 0. FBGEMM and QNNPACK can create modules with stride=0 without error but program crashes when calling forward. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136739 Approved by: https://github.com/jgong5	2024-10-11 05:50:37 +00:00
Ke Wen	fe148024fe	[c10d][experimental] Add _abort_process_group (#132291 ) Thanks @eqy for reminding me of this RFC: https://github.com/pytorch/pytorch/issues/119797 This PR is meant to: - provide a way to abort multiple PGs without deadlocking each other. - provide a possibility to manually handle comm errors or timeouts (and potentially recovery of such). One can find an example from: https://github.com/NVIDIA/nccl/issues/1013 ## How is it different from `destroy_process_group`? `destroy_process_group` is meant for normal exit, while `_abort_process_group` is meant for bailout upon hangs or failures. Similar to `ncclCommDestroy` vs `ncclCommAbort`. ## What's new in `_abort_process_group`? It added support for "group abort" semantic. The "group abort" semantic is capable of aborting multiple NCCL comms concurrently, avoiding deadlock in otherwise serialized `ncclCommAbort` executions. Details are in the [RFC](https://github.com/pytorch/pytorch/issues/119797) targeting [the hang issue in multi-comm case](https://github.com/NVIDIA/nccl/issues/1013). `Group abort` semantic is added in NCCL 2.22. ## What's next? Ideally, the watchdog's behavior should support "group abort" too. But this is hard to implement today due to a lack of "global view" by each PG's individual watchdog. A big semi-big refactor may be needed to "uplift" the watchdogs to a global level or consolidate them into one (i.e. one dog watching multiple PGs). In any case, it may not be a bad idea to experiment the "group abort" feature with a manual API first and then extend to the automatic mode (watchdog). Pull Request resolved: https://github.com/pytorch/pytorch/pull/132291 Approved by: https://github.com/eqy	2024-10-11 05:04:17 +00:00
Tugsbayasgalan Manlaibaatar	bc232e3c08	Fix custom op bug of clearing dir (#137655 ) Previously when we delete a custom op out of context manager, we weren't clearing the dir field of the op namespace. As a result, it was polluting other tests. Differential Revision: [D64141465](https://our.internmc.facebook.com/intern/diff/D64141465/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137655 Approved by: https://github.com/zou3519, https://github.com/Skylion007	2024-10-11 04:32:40 +00:00
Alexander Zinoviev	ee713f80ed	Enable channels_last format for FSDP (#137382 ) Enable FSDP to deal with channels_last memory formatted tensors. Preserving channels_last memory format makes FSDP compatible with the best kernels CUDNN offers. Summary of changes: 1) Store strides information along with shapes 2) Replace calls to flatten() with as_strided(size=(param.numel(),), stride=(1,)) for flattening 3) Replace calls to view() with as_strided with the stored sizes and strides for unflattening Pull Request resolved: https://github.com/pytorch/pytorch/pull/137382 Approved by: https://github.com/awgu	2024-10-11 03:47:16 +00:00
Avik Chaudhuri	8ee361ed13	fix test_retrace_pre_autograd (#137733 ) Test Plan: fixed Differential Revision: D64200918 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137733 Approved by: https://github.com/pianpwk, https://github.com/tugsbayasgalan	2024-10-11 03:46:22 +00:00
xinan.lin	8321eec009	[Inductor UT] Generalize device bias code in test_triton_kernels.py (#137585 ) [Inductor UT] Generalize device bias code in test_triton_kernels.py introduced by #137020 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137585 Approved by: https://github.com/eellison, https://github.com/jansel	2024-10-11 02:00:01 +00:00
Avik Chaudhuri	8262f6d271	fix test_lazy_module_kwargs (#137705 ) Test Plan: fixed Differential Revision: D64185644 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137705 Approved by: https://github.com/tugsbayasgalan	2024-10-11 01:53:10 +00:00
Shangdi Yu	9d4cb0d3eb	Fix param and buffer mapping for state_dict when there are state_dict hooks (#137609 ) Resolve #137540 Summary: We might get different state_dict and named_parameters result when the module has registered custom state_dict_hooks. For exported_program's state_dict, we want the state_dict to reflect the actual module hierarchy at runtime, and it might be different from the model's state_dict() output if the model has state_dict hooks. To do weight swapping, one needs to either re-export or turn-off the hooks when saving model's state_dict(). Previously, ExportedProgram uses nn.Module's state_dict() method to populate its own state_dict, but it doesn't work for some models (e.g. llama3_3_vision) because ExportedProgram's state_dict and an nn.Module's state_dict have some subtle differences semantically. nn.Module's state_dict is about how the state should be serialized, and it reflects the structure of the original user model code. In contrast, export specializes on a “run” of a model, and its state_dict needs to reflect the runtime module hierarchy. One example where these two are different is TorchTune's Llama3_2_vision text decoder. Here, a FusionLayer is added as a local optimization and it is not part of the "static model definition". In runtime, we have mod.layers[3].layer.sa_norm.scale. But in nn.Module's state_dict, the authors of the model added a state_dict hook to remove the "layer" in mod.state_dict() to reflect the static model definition, so we have mod.state_dict()["layers.3.sa_norm.scale"]. In this Diff, we change ExportedProgram to populate its state_dict using named_parameters() and named_buffers() instead. So in ExportedProgram's state_dict, we have "layers.3.layer.sa_norm.scale", which reflects the runtime module hierarchy. Now one problem this presents is weight swapping. Since ExportedProgram's state and the model's state is not the same anymore, weight swapping procedure also needs to change slightly. In internal Ads and RecSys models deployment, weight swapping is where they have one model that is currently being being deployed and serving traffic, and they want to swap out the weights with newly trained model weights without having to redo the whole exporting/lowering process and create a new artifact. So they would move the deployed model’s pointer to the state dict over to the new state dict. Because of this, it’s previously a requirement that the FQNs are matching between the exported and the eager model’s state dict. The new ExportedProgram's state dict still supports weight swapping, but the state_dict to be swapped needs to be obtained from torch.export.exported_program instead of model.state_dict() if the model has state_dict hooks. The new requirement is that the FQNs are matching between the exported’s state dict and the state_dict obtained from `_disabled_load_state_dict_hooks(M)` context manager. One benefit of having this new API is that we are now in full control within export of gathering and updating the model state. If a model doesn't have any state_dict hooks, one can still use model.state_dict() for weight swapping, so it's BC. Test Plan: ``` buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test:test_export -- -r test_export_for_training_with_state_dict_hooks ``` Differential Revision: D64080561 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137609 Approved by: https://github.com/angelayi, https://github.com/pianpwk	2024-10-11 01:33:50 +00:00
Richard Barnes	a919742149	c10::optional -> std::optional in PyTorch (#137333 ) Test Plan: Sandcastle Differential Revision: D63876535 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137333 Approved by: https://github.com/Skylion007, https://github.com/albanD	2024-10-11 00:16:10 +00:00
PyTorch MergeBot	4fb1fd8a51	Revert "Upgrade distributed test to g4dn instances (T4 GPUs) (#137161 )" This reverts commit b6a64dce072240c0b06d2fb03ac81b3ed1b73d92. Reverted https://github.com/pytorch/pytorch/pull/137161 on behalf of https://github.com/PaliC due to broken tests on trunk ([comment](https://github.com/pytorch/pytorch/pull/137161#issuecomment-2406236337))	2024-10-10 23:47:25 +00:00
PyTorch MergeBot	b55ff476bd	Revert "[Distributed] Fix extra context on device 0 (#135273 )" This reverts commit cdd8fa98c77b052085cca65dd54769ae18b72104. Reverted https://github.com/pytorch/pytorch/pull/135273 on behalf of https://github.com/PaliC due to broken tests on trunk ([comment](https://github.com/pytorch/pytorch/pull/137161#issuecomment-2406236337))	2024-10-10 23:47:25 +00:00
Bin Bao	b0da076f0c	[AOTI] Turn on the ABI-compatible mode as default (#136534 ) Summary: Make AOTI generate ABI-compatible code as default for OSS. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136534 Approved by: https://github.com/chenyang78 ghstack dependencies: #137660	2024-10-10 23:44:57 +00:00
Nikita Shulga	ad38bad766	[MPS] Add `tri[lu]_indices` (#137648 ) Requested in https://github.com/pytorch/pytorch/issues/77764#issuecomment-2402365980 Copy-n-paste kernel implementation from `13cf8360d8/aten/src/ATen/native/cuda/TensorFactories.cu (L92)` though use `float` instead of `double` for square root computation Pull Request resolved: https://github.com/pytorch/pytorch/pull/137648 Approved by: https://github.com/Skylion007, https://github.com/albanD ghstack dependencies: #137601, #137647	2024-10-10 23:41:06 +00:00
Bin Bao	573101aac3	[AOTI] Handle inplace output in ProxyExecutor (#137660 ) Summary: https://github.com/pytorch/pytorch/pull/137401 didn't fix the underlying inplace output issue. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137660 Approved by: https://github.com/chenyang78	2024-10-10 23:12:46 +00:00
Justin Chu	c37bb492da	[ONNX] Create an `optimize` method in ONNXProgram (#137667 ) Move optimization from the export call to the `optimize()` method in ONNXProgram. Users can call `optimize()` before calling `save()` to save the model. Right now if users set `optimize=True` in `torch.onnx.export` it will have the same effect as calling `optimize()`, but in the future we can evolve the method to be more flexible (e.g. target aware etc.) Example ```python onnx_program = torch.onnx.export(..., dynamo=True) onnx_program.optimize() onnx_program.save("model.onnx") ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/137667 Approved by: https://github.com/titaiwangms ghstack dependencies: #137666	2024-10-10 22:44:19 +00:00
Justin Chu	e75984cd31	[ONNX] Use torch_2_6 apis from onnxscript (#137666 ) Create an `optimize=False` option in `torch.onnx.export` for model optimization Pull Request resolved: https://github.com/pytorch/pytorch/pull/137666 Approved by: https://github.com/titaiwangms	2024-10-10 22:23:15 +00:00
William Wen	93bbc8abcc	[dynamo, 3.13] use 3.13 multiline traceback in get_instruction_source_311 (#137617 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137617 Approved by: https://github.com/jansel	2024-10-10 20:19:27 +00:00
William Wen	4551a1ee79	[dynamo, 3.13] merge 3.13 FORMAT_* and <=3.12 FORMAT_VALUE (#137656 ) This was causing some 3.13 failures locally. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137656 Approved by: https://github.com/jansel, https://github.com/Skylion007 ghstack dependencies: #137652	2024-10-10 19:53:42 +00:00
William Wen	6b2c3508f8	[dynamo, 3.13] fix typo in remove_fused_load_store (#137652 ) Whoops! Pull Request resolved: https://github.com/pytorch/pytorch/pull/137652 Approved by: https://github.com/jansel, https://github.com/Skylion007	2024-10-10 19:53:42 +00:00
Scott Wolchok	9c12198137	[PyTorch] Port ExecuTorch bfdot improvement back to ATen BlasKernel, Try #2 (#137377 ) ExecuTorch's fork of BlasKernel.cpp grew bfdot support, complete with demonstration that it helps. Port it back to PyTorch. First attempt was https://github.com/pytorch/pytorch/pull/136331 . Differential Revision: [D63923166](https://our.internmc.facebook.com/intern/diff/D63923166/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137377 Approved by: https://github.com/malfet	2024-10-10 19:44:22 +00:00
Richeek Das	080f02ac7a	[dynamo] do not raise an unimplemented error with boolean masking setitem (#134902 ) Cudagraph breaks on boolean masking setitem, however the code runs fine. There is no need to raise an unimplemented error here, since it already warns that its an incompatible op. Fixes #134241 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134902 Approved by: https://github.com/jansel, https://github.com/henrylhtsang	2024-10-10 19:11:40 +00:00
PyTorch MergeBot	079f909263	Revert "Make Context to be Device-agnostic Step by Step (1/N) (#136519 )" This reverts commit be0b75256a7e516217b059ef273901b95c022fe7. Reverted https://github.com/pytorch/pytorch/pull/136519 on behalf of https://github.com/jovianjaison due to this pr is causing errors internally ([comment](https://github.com/pytorch/pytorch/pull/136519#issuecomment-2405781093))	2024-10-10 18:32:17 +00:00
PyTorch MergeBot	33e5921e6b	Revert "Make Context to be Device-agnostic Step by Step (2/N) (#136526 )" This reverts commit 72ad1b8c6c7c364c1974b82a914876dcdf73af44. Reverted https://github.com/pytorch/pytorch/pull/136526 on behalf of https://github.com/jovianjaison due to this pr is causing errors internally ([comment](https://github.com/pytorch/pytorch/pull/136519#issuecomment-2405781093))	2024-10-10 18:32:16 +00:00
eellison	881a18f25f	Set Cuda context in inductor and dont initialize wrong cuda device in fake_tensor (#137603 ) Previously we would construct tensors with "cuda" device which defaults to device:0 if not cuda context is set. Fix for https://github.com/pytorch/pytorch/issues/124854 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137603 Approved by: https://github.com/jansel	2024-10-10 18:25:22 +00:00
Ryan Guo	dd7c2899bd	[dynamo] Properly prune dead cell local variables (#136891 ) This patch updates the `prune_dead_locals` logic to do slightly more aggressive pruning for cell local variables, in absence of side-effects, e.g., a cell variable can be pruned when its user function(s) will never be used again. See added tests for examples; note that a few tests in `test/dynamo/test_higher_order_ops.py` also got updated because we are no longer returning the unnecessary graph output. Fixes #127350, #124653 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136891 Approved by: https://github.com/jansel, https://github.com/anijain2305, https://github.com/williamwen42, https://github.com/zou3519	2024-10-10 18:21:24 +00:00
Haifeng Jin	bcfdb72547	Fix dtype test for NumPy 2 (#137532 ) Related to #107302 The following test fails with NumPy 2. ``` _________ TestNumPyInteropCPU.test_numpy_array_interface_cpu __________ Traceback (most recent call last): File "/usr/local/google/home/haifengj/git/pytorch_np2/test/test_numpy_interop.py", line 415, in test_numpy_array_interface wrapped_x = np.array([1, -2, 3, -4], dtype=dtype) OverflowError: Python integer -2 out of bounds for uint8 To execute this test, run the following from the base repo dir: python test/test_numpy_interop.py TestNumPyInteropCPU.test_numpy_array_interface_cpu This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 ``` According to the official warning from NumPy 1, the assigning negative value to a `uint8` is deprecated. The recommended way is to `np.array([1, -2, 3, -4]).astype(np.uint8)` See the following for details. ``` >>> np.array([1, -2, 3, -4], dtype=np.uint8) <stdin>:1: DeprecationWarning: NumPy will stop allowing conversion of out-of-bound Python integers to integer arrays. The conversion of -2 to uint8 will fail in the future. For the old behavior, usually: np.array(value).astype(dtype) will give the desired result (the cast overflows). <stdin>:1: DeprecationWarning: NumPy will stop allowing conversion of out-of-bound Python integers to integer arrays. The conversion of -4 to uint8 will fail in the future. For the old behavior, usually: np.array(value).astype(dtype) will give the desired result (the cast overflows). array([ 1, 254, 3, 252], dtype=uint8) ``` This PR fixes the test failure. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137532 Approved by: https://github.com/soulitzer	2024-10-10 18:12:25 +00:00
Menglu Yu	5e73f2d7c0	[PT2][Dynamo][Optimus] Add batch detach, clamp and nan_to_num in pre grad (#137415 ) Test Plan: # unit test ``` CUDA_VISIBLE_DEVICES=4 OC_CAUSE=1 buck2 test '@fbcode//mode/dev-nosan' fbcode//caffe2/test/inductor:group_batch_fusion -- test_math_op_fusion ``` Buck UI: https://www.internalfb.com/buck2/185799e1-6ea8-4bd1-b2e1-0c1a8dd92f89 Test UI: https://www.internalfb.com/intern/testinfra/testrun/2533275044114335 Network: Up: 14KiB Down: 287B (reSessionID-d24cee56-2a22-4a90-b4c6-1d0c3ab256c1) Jobs completed: 8. Time elapsed: 48.8s. Cache hits: 0%. Commands: 2 (cached: 0, remote: 0, local: 2) Tests finished: Pass 2. Fail 0. Fatal 0. Skip 0. Build failure 0 # local reproduce ``` CUDA_VISIBLE_DEVICES=3 OC_CAUSE=1 buck2 run @mode/opt scripts/shuaiyang:test -- --optimus --flow_id 648108097 2>&1 \| tee ~/local_run_shuai_interformer_cmf.txt ``` Counter({'pattern_matcher_nodes': 6626, 'pattern_matcher_count': 6396, 'extern_calls': 5340, 'benchmarking.TritonBenchmarker.benchmark_gpu': 2710, 'normalization_pass': 44, 'fxgraph_cache_miss': 37, 'scmerge_split_removed': 16, 'scmerge_cat_removed': 16, 'unbind_stack_pass': 16, 'batch_aten_mul': 15, 'batch_linear_post_grad': 12, 'batch_linear': 5, 'batch_detach': 4, 'batch_nan_to_num': 4, 'batch_clamp': 4, 'batch_aten_add': 4, 'batch_layernorm': 2, 'scmerge_cat_added': 2, 'batch_sigmoid': 1, 'scmerge_split_sections_removed': 1, 'unbind_stack_to_slices_pass': 1, 'benchmarking.TritonBenchmarker.triton_do_bench': 1, 'scmerge_split_added': 1, 'fxgraph_cache_hit': 1, 'batch_aten_sub': 1}) https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/traces/mengluy/2024-10-06-20-53-01/trace.json.gz&bucket=gpu_traces # e2e baseline: f650336422 proposal: f650336607 ### QPS and NE results {F1914975940} {F1914975938} {F1914975939} {F1914975945} > 0.7% QPS gain with NE neutral ### trace analysis Before {F1914990600} After {F1914990015} We reduced green part in the trace introduced by small nan_to_num kernels Differential Revision: D63962711 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137415 Approved by: https://github.com/Yuzhen11	2024-10-10 18:11:08 +00:00
cyy	94e12f97dc	[Distributed] [16/N] Fix clang-tidy warnings in torch/csrc/distributed/c10d (#137404 ) Follows #137072 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137404 Approved by: https://github.com/Skylion007	2024-10-10 18:05:34 +00:00
hjhee	20815c7cb9	Intel GPU: mode: add XPU to supported devices list (#137575 ) Kernel for `mode` Op is being ported to https://github.com/intel/torch-xpu-ops/pull/770, this requires adding XPU to supported device type. Additional context: https://github.com/intel/torch-xpu-ops/issues/327 @fengyuan14 @EikanWang Pull Request resolved: https://github.com/pytorch/pytorch/pull/137575 Approved by: https://github.com/EikanWang, https://github.com/malfet Co-authored-by: Feng Yuan <feng1.yuan@intel.com> Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2024-10-10 17:44:40 +00:00
Ke Wen	cdd8fa98c7	[Distributed] Fix extra context on device 0 (#135273 ) This PR contains multiple fixes for issue https://github.com/pytorch/pytorch/issues/135279: ## First part: Moves the GPU guard (`cudaSetDevice`) before the `currentStreamCaptureStatusMayInitCtx` call. As its name suggests, it May Init Ctx. ## Second part: Even with the above fix, additional contexts are still observed during Work object destruction, e.g. ``` work = dist.all_reduce(tensor, async_op=True) time.sleep(5) <-- no additional context yet del work <-- additional context shows up ``` ### Debug process Chasing it down to destruction of a `Future` object -- a member variable of `Work`. Then further down to the following member of `Future`: ``` std::vector<c10::Event> events_; ``` When the `events_` are destroyed, we hit the road down to: `1f3a793790/c10/cuda/impl/CUDAGuardImpl.h (L106-L121)` When there is no "preset" CUDA context (which is the case for python garbage collector), line 112: `c10::cuda::GetDevice(&orig_device)` will set `orig_device` to 0. Then, at line 120, `c10::cuda::SetDevice(orig_device)` will "officially" set the context to device 0 -- that's where rank 1, 2, ... can create extra context on device 0! ### Solution This PR adds an explicit destructor to `Future`. In this destructor, destroy each event with a device guard. ## Test Added test_extra_cuda_context, implemented via - `pynvml` (if available), or - memory consumption check. `python test/distributed/test_c10d_nccl.py -k test_extra_cuda_context` Pull Request resolved: https://github.com/pytorch/pytorch/pull/135273 Approved by: https://github.com/fduwjj, https://github.com/wconstab, https://github.com/eqy ghstack dependencies: #137161	2024-10-10 17:16:34 +00:00
Colin Peppler	9690cacd61	[aotinductor] Add helper fn to atomically apply size_hint to an expr w/ unbacked symints (#137537 ) ### Context Fixes CUDA IMA in autotune_at_compile_time, where we would generate an example tensor with an incorrect stride. In the case below, the stride should be (u0 * 128, 128, 1). However, we apply the fallback on the entire expr (i.e. u0 * 128). ``` # buf817 = tensor(size=(s0, u0, 128), stride=(u0 * 128, 128, 1)) buf812 = generate_example_value( (64, 8192, 128), (8192, 128, 1), "cuda:0", torch.bfloat16, 0 ) ``` The fix is to apply the fallback on each symbol. ### Test ``` PYTORCH_NO_CUDA_MEMORY_CACHING=1 compute-sanitizer python test_aot_inductor.py -k test_stride_with_unbacked_expr_abi_compatible_cuda ========= Invalid __global__ write of size 2 bytes ``` Differential Revision: [D64074561](https://our.internmc.facebook.com/intern/diff/D64074561) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137537 Approved by: https://github.com/jingsh	2024-10-10 17:11:24 +00:00
Ke Wen	b6a64dce07	Upgrade distributed test to g4dn instances (T4 GPUs) (#137161 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137161 Approved by: https://github.com/seemethere	2024-10-10 17:11:21 +00:00
Oguz Ulgen	034af88c2d	Add a microbechmark for cache read path (#137607 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137607 Approved by: https://github.com/jamesjwu	2024-10-10 16:36:18 +00:00
Nikita Shulga	dae60075e0	[BE][MPS] Use `Tensor`->`TensorBase` in OperationUtils.h (#137647 ) As for the most part those helper method need access to only base class methods. Also replace spurious `at::` namespace prefixes, i.e. `at::Tensor`->`Tensor` Pull Request resolved: https://github.com/pytorch/pytorch/pull/137647 Approved by: https://github.com/Skylion007, https://github.com/albanD ghstack dependencies: #137601	2024-10-10 16:11:17 +00:00
Max Podkorytov	bcf15d1bb4	[AOTI] Add error check for parsing error string from error code (#137626 ) Currently, there are compilation warnings as below, which are resolved after the fix ``` /tmp/torchinductor_root/c7t6qm4gf35cxkk5jywa5booovl5n6ivzwdbbs5og7rdemqtgrzh/caoefkofe5jrkuaoch4lfpjwtodlcy4savxgzsxqldkcdof7ifh7.cpp: In function ‘ihipModuleSymbol_t* loadKernel(std::string, const string&, uint32_t, const std::optional<std::__cxx11::basic_string<char> >&)’: /tmp/torchinductor_root/c7t6qm4gf35cxkk5jywa5booovl5n6ivzwdbbs5og7rdemqtgrzh/caoefkofe5jrkuaoch4lfpjwtodlcy4savxgzsxqldkcdof7ifh7.cpp:482:25: warning: ignoring returned value of type ‘hipError_t’, declared with attribute nodiscard [-Wunused-result] 482 \| hipDrvGetErrorString(code, &msg); \ \| ~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~ /tmp/torchinductor_root/c7t6qm4gf35cxkk5jywa5booovl5n6ivzwdbbs5og7rdemqtgrzh/caoefkofe5jrkuaoch4lfpjwtodlcy4savxgzsxqldkcdof7ifh7.cpp:519:5: note: in expansion of macro ‘CUDA_DRIVER_CHECK’ 519 \| CUDA_DRIVER_CHECK(hipModuleLoad(&mod, filePath.c_str())); \| ^~~~~~~~~~~~~~~~~ In file included from /opt/rocm/include/hip/hip_runtime.h:70, from /pytorch/torch/include/torch/csrc/inductor/aoti_runtime/device_utils.h:14, from /pytorch/torch/include/torch/csrc/inductor/aoti_runtime/model.h:17, from /pytorch/torch/include/torch/csrc/inductor/aoti_runtime/model_container.h:13, from /tmp/torchinductor_root/c7t6qm4gf35cxkk5jywa5booovl5n6ivzwdbbs5og7rdemqtgrzh/caoefkofe5jrkuaoch4lfpjwtodlcy4savxgzsxqldkcdof7ifh7.cpp:4: /opt/rocm/include/hip/hip_runtime_api.h:2369:12: note: in call to ‘hipError_t hipDrvGetErrorString(hipError_t, const char)’, declared here 2369 \| hipError_t hipDrvGetErrorString(hipError_t hipError, const char errorString); \| ^~~~~~~~~~~~~~~~~~~~ In file included from /opt/rocm/include/hip/hip_runtime.h:70, from /pytorch/torch/include/torch/csrc/inductor/aoti_runtime/device_utils.h:14, from /pytorch/torch/include/torch/csrc/inductor/aoti_runtime/model.h:17, from /pytorch/torch/include/torch/csrc/inductor/aoti_runtime/model_container.h:13, from /tmp/torchinductor_root/c7t6qm4gf35cxkk5jywa5booovl5n6ivzwdbbs5og7rdemqtgrzh/caoefkofe5jrkuaoch4lfpjwtodlcy4savxgzsxqldkcdof7ifh7.cpp:4: /opt/rocm/include/hip/hip_runtime_api.h:399:3: note: ‘hipError_t’ declared here 399 \| } hipError_t; \| ^~~~~~~~~~ ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/137626 Approved by: https://github.com/ColinPeppler, https://github.com/chenyang78	2024-10-10 15:14:39 +00:00
Aditya Tewari	575f260229	Extend vectorization with SVE(ARM) with Torch Compile (Inductor) (#134672 ) Motivation Enable SVE vectorization with `torch.compile` Extends PR: #119571 * This PR enables vectorization for codegen part using SVE-256 (vec length) * The changes can be extended to other SVE vec lengths I've done some comparisons against existing NEON implementation with SVE vectorization enabled route for `torch.compile` Test results are for 8 cores on ARM Neoverse_V1 <img width="359" alt="Screenshot 2024-08-28 at 16 02 07" src="https://github.com/user-attachments/assets/6961fbea-8285-4ca3-b92e-934a2db50ee2"> It's worth mentioning, for standalone `SiLU op` there's a `~1.8x` speedup with `torch.compile` Pull Request resolved: https://github.com/pytorch/pytorch/pull/134672 Approved by: https://github.com/jgong5, https://github.com/malfet	2024-10-10 13:20:40 +00:00
Thanh Ha	479bd1f300	Hardlock frequent periodic jobs to Meta runners (#137616 ) The change in pytorch/pytorch#136785 enabled these jobs to run on LF runners however we saw a sudden large spike in cost once that happened last week that would have caused us to over use our available AWS credits. This change hardlocks the tests for these jobs to Meta runners. We need this at least until we can figure out how to handle the additional spend caused by these jobs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137616 Approved by: https://github.com/Skylion007, https://github.com/seemethere	2024-10-10 12:32:16 +00:00
PyTorch MergeBot	f69bf005f7	Revert "In Inductor, be willing to generate deferred runtime asserts when unbacked (#137097 )" This reverts commit 4304c68a4c4d742a3ec5266b81f64a85922509c9. Reverted https://github.com/pytorch/pytorch/pull/137097 on behalf of https://github.com/huydhn due to Sorry for reverting your change, it seems to increase the compilation time a lot causing some jobs to timeout ([comment](https://github.com/pytorch/pytorch/pull/137097#issuecomment-2404573266))	2024-10-10 09:29:05 +00:00
Xiaodong Wang	eea1f79a1d	[AMD] use rccl.h instead of rccl/rccl.h (#135472 ) Summary: We hipify NCCLUtils.h from nccl.h to rccl/rccl.h. This follows the format of the rocm rpm suite (the header is in include/rccl/rccl.h), however the source code is just src/rccl.h. Using the rccl/rccl.h will make us find the rpm's header but not the src code's header. Test Plan: buck run mode/opt-amd-gpu -c hpc_comms.use_rccl=develop -c fbcode.split-dwarf=True --config rccl.build_rdma_core=true --config rccl.adhoc_brcm=true //aps_models/ads/icvr:icvr_launcher -- mode=local_ctr_cvr_cmf_rep_1000x_v1_no_atom data_loader.dataset.table_ds=[2024-09-04] data_loader.dataset.batch_size=512 max_ind_range=10 w/o this diff, it'll show 2.18 nccl version Differential Revision: D62371434 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135472 Approved by: https://github.com/jeffdaily, https://github.com/cenzhaometa	2024-10-10 08:55:57 +00:00
Robert Hardwick	eaab5cf0f9	Fix torch.compile correctness bug on aarch64+sve due to gcc bug (#137606 ) Some unit tests were failing relating to argmin_vec/argmax_vec due to a bug in GCC affecting versions <= 12 on aarch64 platforms with SVE https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117001 Fixes #137597 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137606 Approved by: https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2024-10-10 08:44:53 +00:00
Avik Chaudhuri	365722f606	fix test_constant_output (#137547 ) Summary: Fixes a couple of problems: constants didn't have metadata before creating graph signatures, and graph signatures weren't updated when lifting constants. Test Plan: fixed test Differential Revision: D64081786 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137547 Approved by: https://github.com/tugsbayasgalan	2024-10-10 07:48:15 +00:00
Jason Ansel	4e8997744c	[inductor] Enable coordinate descent tuning with max-autotune (#136867 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136867 Approved by: https://github.com/Chillee	2024-10-10 07:29:52 +00:00
Kurt Mohler	383eba5229	Add deterministic path for CUDA `cumsum` (#136224 ) Change `cumsum` to call its decomposition when `use_deterministic_algorithms(True)` and input is CUDA. Fixes #89492 Fixes #75240 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136224 Approved by: https://github.com/ezyang, https://github.com/justinchuby, https://github.com/eqy	2024-10-10 06:59:08 +00:00
leslie-fang-intel	71010bf097	[Inductor][CPP] generalize the wgt tensor delete (#135101 ) Summary Previously, we assumed the packed weight for (`MKL/MKLDNN`) linear operations was at `new_input_nodes[1]`. However, this is not the case for `MKL linear`, where `new_input_nodes[1]` contains the original weight instead of the packed weight. To generalize the code, in this PR, we identify nodes that are present in `input_nodes` but not in `new_input_nodes`—indicating they are no longer used by the GEMM template and can be considered candidates for deletion. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135101 Approved by: https://github.com/jgong5, https://github.com/jansel	2024-10-10 06:01:09 +00:00
Yifu Wang	ea83c78174	[SymmetricMemory] set the storage_offset of tensors returned by get_buffer() to 0 (#137569 ) It seems that there's a bug in `TensorMaker` - it would treat `storage_offset` as bytes when calculating the storage size, but as numel when setting the tensor `storage_offset`. This seems to be causing tensors returned by get_buffer() with non-0 offset to report wrong storage size. Will look into the `TensorMaker` issue further. But for `get_buffer()`, it seems more natural to just incorporate the offset into the data pointer. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137569 Approved by: https://github.com/weifengpy ghstack dependencies: #137567	2024-10-10 05:05:58 +00:00
Nikita Lutsenko	96bab021c0	ATen \| Fix header namespace pollution. (#137619 ) Summary: Fixing a warning, so we can enable it globally. Test Plan: Sandcastle-only, no runtime changes. Differential Revision: D64122115 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137619 Approved by: https://github.com/Skylion007	2024-10-10 05:04:54 +00:00
Laith Sakka	1aa130e80c	Avoid generating as_strided for alaising views in auto_functionalize_v2 (#137149 ) during auto_functionalize_v2 if we encounter a view such that size() stride() and storage_offset() matches the base we create a view that is regenerated by calling aten.alias instead of as_strided for better performance. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137149 Approved by: https://github.com/zou3519	2024-10-10 05:00:41 +00:00
Valentine233	b5284a01a4	[CPU] remove keyword static for exp_u20 (#137571 ) Remove all the keyword static for constants of vec registers in exp_u20 implementation. With the bf16 input shape of BertLarge, the SDPA kernel improves from 5.1ms to 4.7ms on SPR 56 threads. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137571 Approved by: https://github.com/jgong5	2024-10-10 04:52:04 +00:00
Lu Fang	d170c410f2	Clean up op BC check list (#137634 ) Summary: Remove some stale items Test Plan: CI Differential Revision: D64133246 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137634 Approved by: https://github.com/hl475	2024-10-10 04:29:21 +00:00
sanshang	249152475d	fix sequence number for group (#134578 ) Summary: Fix sequence number in execution trace dump for matching between collective/p2p op and wait in execution trace replay. `ProcessGroupNCCL` has 2 sequence number counter, `seqCollective_` and `seqP2P_`. `b18ba9419e/torch/csrc/distributed/c10d/ProcessGroupNCCL.hpp (L1188-L1191)` However, `WorkNCCL` only has one sequence number member `seq_`. `b18ba9419e/torch/csrc/distributed/c10d/ProcessGroupNCCL.hpp (L387)` We need to match collective and p2p with wait separately. `29b5a462dc` Depend on: https://github.com/pytorch/pytorch/pull/135132 Test Plan: buck2 run mode/dev-nosan kineto/libkineto/fb/integration_tests:pytorch_execution_trace_integration_test Differential Revision: Pull Request resolved: https://github.com/pytorch/pytorch/pull/134578 Approved by: https://github.com/kwen2501, https://github.com/c-p-i-o	2024-10-10 04:24:06 +00:00
Finlay Sanders	5aa9f2b660	Fixed issue with nn.Transformer().generate_square_subsequent_mask() (#137654 ) Fixed issue where nn.Transformer().generate_square_subsequent_mask() doesn't respect set_default_device() and set_default_dtype(). Fixes #137186 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137654 Approved by: https://github.com/mikaylagawarecki	2024-10-10 03:10:01 +00:00
Nichols A. Romero	b9c9f7f0fa	Document ROCm environment variables and improve CMake messaging to user (#137308 ) Fixes #115725. Note that the github issue title is misleading. Read the comments to understand what the problem is really about. The PR improves the documentation and CMake's behavior for ROCM builds. - Documentation: There were two environment variables for ROCm builds that are now documented. `ROCM_PATH` and `PYTORCH_ROCM_ARCH`. - CMake: Improved diagnostic messaging and error handling with respect to `ROCM_PATH` Pull Request resolved: https://github.com/pytorch/pytorch/pull/137308 Approved by: https://github.com/pruthvistony, https://github.com/jithunnair-amd, https://github.com/jeffdaily	2024-10-10 03:08:08 +00:00
Laith Sakka	f394fb554b	Enable failing diffs for regressions on basic_modules_ListOfLinears benchmarks (#137541 ) Note that basic_modules_ListOfLinears_inductor_gpu_force_shape_pad is flay with 8% detected variance, I set it up with 20% threshold (8*2)++ others are stable within +-1.5% <img width="611" alt="Screenshot 2024-10-08 at 4 19 03 PM" src="https://github.com/user-attachments/assets/103c4bc7-6be8-41bf-ac31-4b8909fabfcf"> <img width="1581" alt="Screenshot 2024-10-08 at 4 18 56 PM" src="https://github.com/user-attachments/assets/56006f7a-e7de-4966-9a05-9263195adc68"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/137541 Approved by: https://github.com/aorenste	2024-10-10 02:47:38 +00:00
Jane Xu	f9ed39c989	Autoupdate min_lrs for ReduceLROnPlateau if possible, fixes #104361 (#137637 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137637 Approved by: https://github.com/albanD	2024-10-10 01:23:30 +00:00
Michael Lazos	d50d5df2fb	Add warning for non static grads in optimizer variable (#137554 ) Fixes https://github.com/pytorch/pytorch/issues/112548 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137554 Approved by: https://github.com/williamwen42	2024-10-10 01:23:21 +00:00
Miles	f301f6544b	fix bug for fill_empty_deterministic_ not support complex half (#137488 ) Fixes #133157 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137488 Approved by: https://github.com/ezyang	2024-10-10 01:21:32 +00:00
Laith Sakka	361046718d	Generate new expected results file when there is failures in diff time benchmarks (#137551 ) The test also add singpost log for the benchmarks that pass. to test run I ran python check_results.py test_check_result/expected_test.csv test_check_result/result_test.csv out.csv results ``` WIN: benchmark ('a', 'instruction count') failed, actual result 90 is -18.18% lower than expected 110 ±1.00% please update the expected results. REGRESSION: benchmark ('b', 'memory') failed, actual result 200 is 100.00% higher than expected 100 ±+10.00% if this is an expected regression, please update the expected results. PASS: benchmark ('c', 'something') pass, actual result 107 +7.00% is within expected 100 ±10.00% MISSING REGRESSION TEST: benchmark ('d', 'missing-test') does not have a regression test enabled for it. You can use the new reference expected result stored at path: out.csv. a,instruction count,90,0.01 b,memory,200,0.1 c,something,100,0.1 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/137551 Approved by: https://github.com/aorenste	2024-10-10 01:09:15 +00:00
Edward Z. Yang	d9f4a7d3f9	Simplify find_localzeros (#133325 ) Instead of doing an N^2 connected thing, only do simplifications for binary max/min, and for very simple situations. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Differential Revision: [D64135230](https://our.internmc.facebook.com/intern/diff/D64135230) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133325 Approved by: https://github.com/albanD	2024-10-10 00:52:50 +00:00
Ke Wen	4f45c76806	[PGNCCL] Limit access to ncclComm_ (#137573 ) When non-blocking mode is enabled, we need to make sure `ncclComm_` is ready before calling NCCL APIs on it. `NCCLComm::getNcclComm` help us do that (thanks to a wait function inside), thus is preferred than directly using `ncclComm_`. To prevent `ncclComm_` from being directly used outside, e.g. in `ProcessGroupNCCL`, we also move it as a private member of `NCCLComm` class -- the external-facing wrapper. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137573 Approved by: https://github.com/Skylion007, https://github.com/shuqiangzhang, https://github.com/c-p-i-o ghstack dependencies: #137572	2024-10-10 00:34:05 +00:00
cyy	0739efbd1f	Remove reference of gcc7 from CI scripts (#137339 ) Because gcc7 can't be used to build Pytorch. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137339 Approved by: https://github.com/ezyang, https://github.com/malfet	2024-10-10 00:29:29 +00:00
Shuqiang Zhang	47a515d260	[c10d] simplify barrier implementation and further decouple CPU/GPU (#137516 ) synchronization Summary: Barrier is essentially intended to block CPU thread (instead of GPU streams). Before we used 2 stream synchronizations (1. current stream blocked by nccl stream end event, 2. CPU thread blocked on current stream). This is unnecessary as we already have CPU thread blocking logic in wait(). Also, adding barrier specific code block in the general GPU synchronize() API is intrusive and confusing. This PR cleans this. Test Plan: CI Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/137516 Approved by: https://github.com/fduwjj, https://github.com/kwen2501	2024-10-09 23:55:28 +00:00
Huy Do	51c33c0b72	Increase the runner size of AVX* jobs to 4xlarge (#137633 ) The failed test is recently moved backed from slow and it requires more RAM than what available on 2xlarge runner. It looks ok to up the instance size to 4xlarge instead. I missed periodic jobs in https://github.com/pytorch/pytorch/pull/137447 Example periodic failures `de4c2a3b4e` (test_cpu_repro) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137633 Approved by: https://github.com/seemethere, https://github.com/malfet	2024-10-09 23:43:49 +00:00
Edward Z. Yang	4304c68a4c	In Inductor, be willing to generate deferred runtime asserts when unbacked (#137097 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/137097 Approved by: https://github.com/angelayi ghstack dependencies: #137091	2024-10-09 23:34:35 +00:00
Edward Z. Yang	6908d8d450	Enable python dispatcher for reinplacing pass (#137091 ) Arguably this should be put somewhere higher up in the stack? Not sure. Xref: https://fb.workplace.com/groups/6829516587176185/permalink/8042762615851570/ There is a repro but I need to fix more bugs before it can be checked in Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/137091 Approved by: https://github.com/bdhirsh	2024-10-09 23:34:35 +00:00
Felix Janda	31e334ad9e	[unwind] replace LONG_LONG_MAX by the portable LLONG_MAX (#125043 ) This fixes a compilation error on systems with the musl c library. Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/125043 Approved by: https://github.com/aaronenyeshi	2024-10-09 23:34:16 +00:00
Yifu Wang	aafa02506e	[CudaDMAConnectivityDetector] improve the detection robustness (#137530 ) - Previously the detection would fail before user calling APIs such as `torch.cuda.set_device()`. This is because the detection logic requires nvml initialization. In this PR, we added explicit nvml initialization (which idempotent). - Previously any nvml issue occurred in the detection logic would result in fatal error. Now we issue an informative warning and return a topology assuming no NVLink connectivity. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137530 Approved by: https://github.com/Chillee ghstack dependencies: #137471, #137472, #137473, #137474, #137475, #137529	2024-10-09 23:30:16 +00:00
Yifu Wang	fbaf9b62de	[SymmetricMemoryOps] use float32 as the accumulator type when accumulating bfloat16 with multimem.ld_reduce (#137529 ) This provides better accuracy without additional cost. Also added documentation to `multimem_one_shot_all_reduce` to note the numerical caveats. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137529 Approved by: https://github.com/Chillee ghstack dependencies: #137471, #137472, #137473, #137474, #137475	2024-10-09 23:30:16 +00:00
Yifu Wang	39c5122a4f	[IntraNodeComm] replace all-reduce kernels with corresponding symm_mem ops (#137475 ) ## This Stack Implement custom all-reduce algos available in `IntraNodeComm` as `symm_mem` ops and replace the existing `IntraNodeComm` kernels with them. ## This PR - Replaces one-shot all-reduce with `symm_mem::one_shot_all_reduce_out` - Replaces two-shot all-reduce with `symm_mem::two_shot_all_reduce_` - Removes HCM all-reduce (at least for now). Due to the nature of its accumulation order, we can't guarantee the numerical consistency across all ranks. - Removes the `IntraNodeComm` python binding (its original purpose is superceded by `SymmetricMemory`). - Removes methods that were made for the python binding. - Replaces nvlink detection logic with `DMAConnectivityDetector`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137475 Approved by: https://github.com/Chillee ghstack dependencies: #137471, #137472, #137473, #137474	2024-10-09 23:30:16 +00:00
Yifu Wang	e6edfe3928	[SymmetricMemoryOps] create an out-variant for multimem_one_shot_all_reduce (#137474 ) ## This Stack Implement custom all-reduce algos available in `IntraNodeComm` as `symm_mem` ops and replace the existing `IntraNodeComm` kernels with them. ## This PR Implement `symm_mem::multimem_one_shot_all_reduce_out`. The out-variant is more suitable for `IntraNodeComm` integration. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137474 Approved by: https://github.com/Chillee ghstack dependencies: #137471, #137472, #137473	2024-10-09 23:30:16 +00:00
Bob Ren	b22749712c	type _inductor/optimize_indexing.py (#137599 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137599 Approved by: https://github.com/Skylion007, https://github.com/eellison	2024-10-09 23:29:47 +00:00
Bob Ren	d67b4f9e5f	type _inductor/quantized_lowerings.py (#137598 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137598 Approved by: https://github.com/Skylion007	2024-10-09 23:29:26 +00:00
Bob Ren	9b01d17b8d	Use MetaProxy more pervasively (#137588 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137588 Approved by: https://github.com/ezyang ghstack dependencies: #136674	2024-10-09 23:22:03 +00:00
Nikita Shulga	13cf8360d8	[MPS] Fix testing for generator operators (#137601 ) Before this changes, tests for operators like `eye` or `triu_indices` were essentially a test that respective CPU operators are stable, as cpu_sample and mps_sample were the same Moved the logic to `transform_opinfo_sample_to_mps` whicih in addition to copying tensors is also tweaks `kwargs` Discovered that: - `torch.randn` and `torch.randint` fall into the same undefined category - `torch.logspace` is not implemented for MPS - Allow 1.0 absolute tolerance for all `torch.linspace` calls over integral input as rounding is wrong on the MPS side - `torch.triu_indices` are not implemented (PR is coming, this is how I've discovered this problem) - `torch.signal.windows.kaiser` fails because `aten::i0` is not implemented Pull Request resolved: https://github.com/pytorch/pytorch/pull/137601 Approved by: https://github.com/albanD	2024-10-09 23:17:11 +00:00
Bob Ren	48fe0d56d6	Type _inductor/exc.py (#137595 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137595 Approved by: https://github.com/Skylion007	2024-10-09 23:15:06 +00:00
Edward Z. Yang	7408742b67	Make ignore_fresh_unbacked_symbols reentrant (#137605 ) I have a test but it requires some other feature work that isn't fully baked. Maybe this will fix an xfail. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/137605 Approved by: https://github.com/albanD	2024-10-09 23:08:05 +00:00
Jin Zhou	5516ac5c21	[ROCm] Tunableop record untuned (#128813 ) When enable tunableop, It is easy to have OOM since APP usually needs large video memory size, such as running a LLM for inference. So we need a offline mode to tune the GEMMs. This PR provide an offline mode for tunableOp: - record untuned GEMMs to file. - a python API named tune_gemm_in_file is added to read the untuned file and tune the GEMMs in file Pull Request resolved: https://github.com/pytorch/pytorch/pull/128813 Approved by: https://github.com/jeffdaily, https://github.com/hongxiayang, https://github.com/naromero77amd Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2024-10-09 21:59:03 +00:00
Simon Fan	839d3568b0	[compiled autograd] fix -Wuninitialized (#137539 ) https://github.com/pytorch/pytorch/pull/135663#discussion_r1792408353 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137539 Approved by: https://github.com/isuruf, https://github.com/Skylion007	2024-10-09 21:16:26 +00:00
Yifu Wang	38027b9b47	[SymmetricMemory] fix a bug where numel calculation overflows when the tensor size is large (#137567 ) Fixes https://github.com/pytorch/pytorch/issues/137145 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137567 Approved by: https://github.com/Chillee, https://github.com/weifengpy	2024-10-09 20:45:57 +00:00
Andrew Gu	a93ea617b5	[FSDP2] Required `mesh_dim_names` for HSDP (#137436 ) Two changes: 1. Require `mesh_dim_names` if using HSDP 2. Pass only the shard mesh to `fsdp_pre_all_gather` Change 1 is technically BC breaking, but it should not be hard to fix on the user side. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137436 Approved by: https://github.com/weifengpy, https://github.com/wz337	2024-10-09 20:35:09 +00:00
eellison	47af7cc962	Add compiler bisector (#131936 ) This is a utility to aid the torch.compile debugging. You provide a function that returns True on success, False on failure, or do something out of process and run bisect_helper `good \| bad`. The bisector will first go through backends - `eager`, `aot_eager`, `aot_eager_decomp_partition`, `inductor` to find the first failing backend. Then, it will go through subsystems within the backend - currently limited but could be expanded - and try to find the first subsystem for which disabling fixes the problem. Once it has found the failing subsystem, it will find the number of times the subsystem is applied, and then bisect through it. An example usage of how to hook it up for aot_eager_decomp_partition and decomposition subsystem is : ``` from torch._inductor.bisect_helper import BisectionManager if op in CURRENT_DECOMPOSITION_TABLE: if BisectionManager.disable_subsystem("aot_eager_decomp_partition", "decomposition", lambda: repr(op)): return NotImplemented ``` Once it has discovered the problematic change, it will print out the associated debug info, and you can set the same limits with `TORCH_BISECT_BACKEND` `TORCH_BISECT_SUBSYSTEM` and `TORCH_BISECT_MAX`. We could add further options as an automated way of going through a check list for checking divergence - e.g., the mode to emulate amp casts. Fix for https://github.com/pytorch/pytorch/issues/126546 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131936 Approved by: https://github.com/ezyang	2024-10-09 20:34:11 +00:00
Jane Xu	cfe970260a	Clarify opt-einsum usage, fix #127109 (#137596 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137596 Approved by: https://github.com/albanD	2024-10-09 20:31:24 +00:00
PyTorch MergeBot	c73d2634b9	Revert "Log chromium event for automatic dynamic reasons (#137491 )" This reverts commit 3c1ab9367885fdb0ead5fcc14a22d6934070ca92. Reverted https://github.com/pytorch/pytorch/pull/137491 on behalf of https://github.com/jovianjaison due to breaking internal tests ([comment](https://github.com/pytorch/pytorch/pull/137491#issuecomment-2403360486))	2024-10-09 20:24:12 +00:00
PyTorch MergeBot	16a2c2cfd4	Revert "Introduce torch.sym_sum (#136429 )" This reverts commit 90bed32b986ab1356dc376df3985497cedbe8a29. Reverted https://github.com/pytorch/pytorch/pull/136429 on behalf of https://github.com/ezyang due to fails internal stuff ([comment](https://github.com/pytorch/pytorch/pull/136429#issuecomment-2403335147))	2024-10-09 20:08:01 +00:00
Ke Wen	572f506f9c	[c10d] Improve split_group test (#137572 ) Fix 1: `backend1 = pg._get_backend`, here `pg` should be `ng1`. Fix 2: `dist.broadcast` should be called by ranks of subgroup `ng1` only. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137572 Approved by: https://github.com/Skylion007	2024-10-09 19:43:57 +00:00
Mikayla Gawarecki	70288c3c2d	Remove dependency on numpy for serialization for XLA/open registration devices without numpy (#137444 ) Related: https://github.com/pytorch/xla/issues/7799#issuecomment-2375818263 Follow ups: Do the same for maia and mtia ## Motivation With the move to `weights_only` by default, we are making an explicit decision not to allowlist GLOBALs required to deserialize `numpy` tensors by default. The implication is that backends relying on numpy for serialization will fail loudly when `torch.load` flips `weights_only`. However, we make the observation that this dependency on numpy was legacy and is not actually needed anymore. So we can remove it, which aligns with our weights_only strategy. ## Why is this ok? The following comment on why numpy is necessary for serialization is legacy `c87c9f0a01/torch/_tensor.py (L303-L312)` We no longer do the following, though it was the case 5 years ago in the PR that added this > CPU storage is reconstructed with randomly initialized data, moved onto backend device, and then storage is updated to the serialized content Instead what now happens is that CPU storage is constructed with data from the file and then moved onto backend device. Old behavior (`legacy_load`): `67adda891a/torch/serialization.py (L620)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/137444 Approved by: https://github.com/albanD	2024-10-09 19:35:55 +00:00
Andrew Gu	aa61e251d4	[FSDP2] Added `shard_placement_fn` arg (#137496 ) ## Overview This PR adds a `shard_placement_fn: Optional[Callable[[nn.Parameter], Optional[Shard]]` arg to `fully_shard` that allows users to specify FSDP sharding on a nonzero tensor dim. If doing so, then the tensor dim size must be divisible by the FSDP shard world size. ``` # Example: def shard_placement_fn(param: nn.Parameter) -> Optional[Shard]: largest_dim = largest_dim_size = -1 for dim, dim_size in enumerate(param.shape): if dim_size > largest_dim_size: largest_dim = dim largest_dim_size = dim_size return Shard(largest_dim) fully_shard(module, shard_placement_fn=shard_placement_fn) ``` ## Follow-Ups - Copy kernels: For all-gather copy-out, we currently copy-out to temporaries and then chunk-dim-0 -> cat-shard-dim, incurring an extra copy for parameters sharded on nonzero tensor dim. Similarly, for reduce-scatter copy-in, we currently chunk-shard-dim -> cat-dim-0, incurring an extra copy for gradients sharded on nonzero tensor dim. @yifuwang has ideas for adding additional split size args to the copy ops that allows fusing these extra copies into the existing all-gather copy-out and reduce-scatter copy-in. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137496 Approved by: https://github.com/weifengpy ghstack dependencies: #137593	2024-10-09 19:13:32 +00:00
Bob Ren	36133f39db	Tensorify compute on Python scalars (#136674 ) Signed-off-by: Bob Ren <bobrenfb.com> Comandeered from https://github.com/pytorch/pytorch/pull/130228 as I'm helping @ezyang w/ shipping dynamic float arguments in PT2. This starts with supporting torch.ops.aten.mul. I'll stack on top support for other operators in subsequent PRs to keep this scoped to the mechanics of the fx pass. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136674 Approved by: https://github.com/ezyang	2024-10-09 18:51:41 +00:00
Bob Ren	f15edb291a	type _dynamo/trace_wrapped_higher_order_op.py (#137354 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137354 Approved by: https://github.com/Skylion007, https://github.com/jansel	2024-10-09 18:35:28 +00:00
Zhiyong Wang	9a957e2842	[NCCL][Profiler] Add functionality to call dump function of NCCL profiler plugin (#137523 ) Summary: NCCL 2.23.4 provides the profiler plugin feature, which traces collective, p2p, proxyOps, and other events. The diff supports the following feature: when NCCL times out, the flight recorder can also dump traces in the profiler plugin. Test Plan: ``` tensor = torch.tensor([dist.get_rank()], dtype=torch.int32, device=dev) # Create a list with same number of elements as world size (aka no. of ranks) # During allgather this list is going to be populated with tensors from all ranks (aka all gather) gathered_tensors = [torch.zeros_like(tensor) for _ in range(WORLD_SIZE)] # get collective from all ranks if i <= 10 or RANK != 0: dist.all_gather(gathered_tensors, tensor) ``` My script triggers flight recoder. ``` trainer/0 [0]:E0927 12:07:22.643702 1012209 ProcessGroupNCCL.cpp:1356] [PG ID 0 PG GUID 0(default_pg) Rank 0] ProcessGroupNCCL preparing to dump debug info. trainer/0 [0]:I0927 12:07:22.643784 1012209 ProcessGroupNCCL.cpp:392] NCCL_PROFILER_PLUGIN: /data/users/zhiyongww/fbsource/fbcode/scripts/nbahl/libnccl_profiler_plugin.so trainer/0 [0]:I0927 12:07:22.643805 1012209 plugin.cpp:559] Profiler start dump trainer/0 [0]:I0927 12:07:22.645249 1012209 ProcessGroupNCCL.cpp:1363] [PG ID 0 PG GUID 0(default_pg) Rank 0] ProcessGroupNCCL dumping nccl trace to /tmp/nccl_trace_rank_0 trainer/0 [0]:I0927 12:07:22.645418 1012209 NCCLUtils.cpp:348] Finished writing NCCLPG debug info to /tmp/nccl_trace_rank_0 ``` Content from /tmp/nccl_trace_rank_0: P1614645283 Differential Revision: D61929401 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137523 Approved by: https://github.com/c-p-i-o	2024-10-09 18:19:33 +00:00
Ryan Guo	394c143e4e	[dynamo] Fix error when inlining certain nested closure returned by another function (#137510 ) See `test_inline_closure_returned_by_another_function_and_captures` and #136814 for more context. In #90286, we introduced an optimization so that for captured cells that are unmodified during a Dynamo trace, `UserFunctionVariable` will represent them as variable of the cell's actual value, rather than a `NewCellVariable`. Later on we introduced more mechanisms to model such cells across function calls (#104222), and across function calls where `NestedUserFunctionVariable::bind_args` need to look up further in the parent frames (#106491) to find these cells' values. This patch removes `InlinedClosureVariable` in favor of a simpler modelling, which is also more consistent with what was introduced in #90286, i.e., just model these cells as their contents, in `symbolic_locals`. This fixes #136814 because resolution of `InlinedClosureVariable` to the underlying cell content value happens in `NestedUserFunctionVariable::bind_args`, which requires Dynamo to have the value in scope at the function call site (when Dynamo does inlining), but's not always the case (as the test case shows). However, if we model the cells in `symbolic_locals`, we never need such resolution, and the values are directly stored into the `NestedUserFunctionVariable::closure` upon the function creation, at which point Dynamo always has the cell value in `symbolic_locals` for look up. Fixes #136814. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137510 Approved by: https://github.com/williamwen42	2024-10-09 18:13:57 +00:00
Justin Chu	018dabff20	[ONNX] Implement patch for jit.isinstance (#137592 ) Patch torch.jit.isinstance for users for models to be dynamo exportable. Replaces https://github.com/pytorch/pytorch/pull/137487. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137592 Approved by: https://github.com/titaiwangms, https://github.com/xadupre	2024-10-09 18:06:52 +00:00
Andrew Gu	ceb2fcc5db	[FSDP2] Fixed incorrect tensor meta after `.to(dtype)` (#137593 ) This fixes https://github.com/pytorch/pytorch/issues/137522. After a method that changes to module parameters (like `.to(torch.float64)`), we need to update the `DTensorSpec`, whose `TensorMeta`'s dtype may have changed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137593 Approved by: https://github.com/Skylion007	2024-10-09 17:57:11 +00:00
Huanyu He	bae8d5853e	[TorchRec][PT2 compile] enable dynamo in _get_user_embeddings (#136798 ) Summary: # context * enable the `_get_user_embeddings` function * run failed at P1610151892 ``` torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised: GuardOnDataDependentSymNode: Could not guard on data-dependent expression u22 <= 0 (unhinted: u22 <= 0). (Size-like symbols: u22) ATTENTION: guard_size_oblivious would fix the error, evaluating expression to False. Maybe you need to add guard_size_oblivious to framework code, see doc below for more guidance. Potential framework code culprit (scroll up for full backtrace): File "/data/users/hhy/fbsource/buck-out/v2/gen/fbcode/38472faba4e3e6c1/aps_models/ads/icvr/__icvr_launcher_live__/icvr_launcher_live#link-tree/torch/_decomp/decompositions.py", line 1692, in native_layer_norm_backward if M <= 0 or N <= 0: ``` ``` N = prod(inner_dims) # type: ignore[arg-type] M = prod(outer_dims) # type: ignore[arg-type] if M <= 0 or N <= 0: return ( input.new_zeros(input_shape) if output_mask[0] else None, input.new_zeros(input_shape[axis:]) if output_mask[1] else None, input.new_zeros(input_shape[axis:]) if output_mask[2] else None, ) ``` # changes * use guard_size_oblivious since the new_zeros return is kind of optimization, shouldn't impact the correctness of the follow up code logic. * the size `ret[i][j]` could be zero, so the change in V1 isn't valid * for more details: [post](https://fb.workplace.com/groups/6829516587176185/permalink/8003616173099548/) ``` from torch.fx.experimental.symbolic_shapes import guard_size_oblivious if guard_size_oblivious(M <= 0) or guard_size_oblivious(N <= 0): ``` # past * found `u22` was introduced at ``` def _wait_impl(self) -> List[List[int]]: # Can not use is_torchdynamo_compiling(), as every such condition should be independent for compilation with graph breaks. if isinstance(self._splits_awaitable, dist.Work): self._splits_awaitable.wait() ret = self._output_tensor.view(self.num_workers, -1).T.tolist() # <------ u22 introduced here if not torch.jit.is_scripting() and is_torchdynamo_compiling(): for i in range(len(ret)): for j in range(len(ret[i])): torch._check_is_size(ret[i][j]) # <---------- my question: why the _check_is_size isn't enough?? torch._check(ret[i][j] > 0) # <------ added by diff V1 ``` Test Plan: # run command ``` TORCH_SHOW_CPP_STACKTRACES=1 TORCHDYNAMO_EXTENDED_DEBUG_CPP=1 TORCH_LOGS="+graph_code,output_code,dynamic,aot,guards,verbose_guards,recompiles,graph_breaks" TORCH_TRACE=/var/tmp/tt buck2 run fbcode//mode/opt fbcode//aps_models/ads/icvr:icvr_launcher_live -- mode=fmc/local_ig_fm_v4_mini training.pipeline_type=pt2 2>&1 \| tee -a `tagT`.`tagH`.log ``` # results * before without enabling `_get_user_embeddings` [14 Failures and Restarts](https://interncache-all.fbcdn.net/manifold/tlparse_reports/tree/logs/.tmp2eNI7p/failures_and_restarts.html) log: P1610151892 {F1889387940} * V1 enable `_get_user_embeddings` with `torch._check(ret[i][j] > 0)` [13 Failures and Restarts](https://interncache-all.fbcdn.net/manifold/tlparse_reports/tree/logs/.tmp6J1iY9/failures_and_restarts.html) {F1889388378} * V2 enable `_get_user_embeddings` with `if guard_size_oblivious(M <= 0) or guard_size_oblivious(N <= 0):` [tlparse](https://interncache-all.fbcdn.net/manifold/tlparse_reports/tree/logs/.tmpFhZZyC/index.html) if guard_size_oblivious(M <= 0) or guard_size_oblivious(N <= 0): Differential Revision: D63424929 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136798 Approved by: https://github.com/ezyang	2024-10-09 17:19:45 +00:00
James Wu	4d45536e92	Save aot graph code in AOTAutogradCache for logging purposes (#137432 ) Save the string graph code from print_readable Differential Revision: [D63985711](https://our.internmc.facebook.com/intern/diff/D63985711/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137432 Approved by: https://github.com/bdhirsh ghstack dependencies: #137431	2024-10-09 16:59:08 +00:00
Masaki Kozuki	b71d0ac3b1	remove unused variable (#137565 ) per title Pull Request resolved: https://github.com/pytorch/pytorch/pull/137565 Approved by: https://github.com/Skylion007	2024-10-09 16:31:43 +00:00
Oguz Ulgen	ae03c0cff3	Add microbenchmark for FxGraphHashDetails.debug_lines (#137506 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137506 Approved by: https://github.com/jamesjwu	2024-10-09 16:15:05 +00:00
albanD	e945b6600d	Support 3.8 compile again (#137587 ) This is not going to be very reliable since we don't have CI though... Pull Request resolved: https://github.com/pytorch/pytorch/pull/137587 Approved by: https://github.com/Skylion007	2024-10-09 15:54:52 +00:00
Xinran / Allan Rui	1d15dd7891	Fix triton_reshape to properly expand `Min` keyword in triton codegen (#137357 ) Summary: Previously triton_reshape will generate code with `Min` keyword in it, which is incorrect. This diff updates the triton_reshape function to properly expand `Min` keyword to `<`. Test Plan: ``` buck2 run @//mode/{opt,mtia,inplace} //glow/fb/fx/fba/tests:test_fba_inductor -- -r test_Min_keyword_in_block_shape ``` Differential Revision: D63850158 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137357 Approved by: https://github.com/blaine-rister, https://github.com/eellison	2024-10-09 15:53:45 +00:00
Will Feng	de4c2a3b4e	Add AsyncCollectiveTensor isinstance check to test_graph_input_is_async (#137253 ) This PR doesn't change the logic of `test_graph_input_is_async` - it just adds an additional check to the graph input type to ensure it's always `AsyncCollectiveTensor` as expected. It would potentially make it easier to show to users that we already support `AsyncCollectiveTensor` as graph input. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137253 Approved by: https://github.com/bdhirsh	2024-10-09 08:06:16 +00:00
Valentine233	ac8954d1ca	[pattern match][SDPA] remove contiguous in sdpa replacement (#136930 ) Fixes a perf issue which is found internally. In the case, we see query(size=[1, 16, 384, 64], stride=[393216, 64, 1024, 1]) in model code. However before entering SDPA, it becomes query(size=[1, 16, 384, 64], stride=[393216, 24576, 64, 1]). This is caused by the [SDPA pattern match](https://github.com/pytorch/pytorch/blob/main/torch/_inductor/fx_passes/fuse_attention.py#L130-L132), which applies contiguous to inputs in replacement. This is not necessary as the contiguous doesn't exist in pattern. Furthermore, it could sometimes cause perf issues. Anyway, we can do the additional contiguous in the kernel implementation if needed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136930 Approved by: https://github.com/Skylion007, https://github.com/drisspg, https://github.com/jgong5	2024-10-09 07:52:38 +00:00
FFFrog	72ad1b8c6c	Make Context to be Device-agnostic Step by Step (2/N) (#136526 ) - add new method(getDefaultGenerator, getNewGenerator) into AcceleratorHooksInterface Pull Request resolved: https://github.com/pytorch/pytorch/pull/136526 Approved by: https://github.com/ezyang, https://github.com/EikanWang ghstack dependencies: #136519	2024-10-09 07:34:30 +00:00
Avik Chaudhuri	a02093e824	fix test_export_constraints_error_not_in_range (#137500 ) Test Plan: fixed Differential Revision: D64052011 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137500 Approved by: https://github.com/tugsbayasgalan	2024-10-09 05:48:14 +00:00
zeshengzong	abb00efc14	Add torch.squeeze parameter description to declare allowed type (#137485 ) Fixes #137422 Add parameter type definition in API docs to clarify allowed value type, eliminate users pass `None` as `dim` value directly. ```python >>> import torch >>> x = torch.randn(3,1,2) >>> x.squeeze(dim=None) Traceback (most recent call last): File "<stdin>", line 1, in <module> RuntimeError: Please look up dimensions by name, got: name = None. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/137485 Approved by: https://github.com/albanD	2024-10-09 05:29:13 +00:00
Huy Do	df114a447e	Parametrize test_lstm_packed (#137447 ) The test runs all its combination (512) sequentially, so it takes more than 30 minutes to finish or timeout on ASAN after one hour. Parametrizing it will break it up, so individual tests can finish and aren't need to be marked as slow anymore. Also, the test seems to run OOM on a 2xlarge with `std::bad_alloc` memory error. Maybe, this would also fix the issue (pending CI testing) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137447 Approved by: https://github.com/albanD, https://github.com/malfet	2024-10-09 05:13:53 +00:00
PyTorch MergeBot	2fff990c16	Revert "[AutoAC] Backward Pass Aware AC - changes to partitioner to acommodate SOLVER as a callable (#137314 )" This reverts commit 932b9945c0bc61a11a7db2f52c974cf283d5a2ed. Reverted https://github.com/pytorch/pytorch/pull/137314 on behalf of https://github.com/huydhn due to The failure shows up in trunk ([comment](https://github.com/pytorch/pytorch/pull/137314#issuecomment-2401311719))	2024-10-09 04:53:30 +00:00
Jane Xu	972822dea1	Minorly reorder optim kwargs in docs, fixes #137391 (#137531 ) Closes #137391 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137531 Approved by: https://github.com/albanD	2024-10-09 04:14:45 +00:00
Benjamin Glass	4628fcf41a	Fix ir._WaitKernel (#137401 ) In ABI-compatible mode, AOTInductor could not compile _WaitKernel due to an incorrect outputs list. Add the correct set of outputs, as done in ir._CollectiveKernel.create_out_of_place. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137401 Approved by: https://github.com/desertfire ghstack dependencies: #136924	2024-10-09 04:02:30 +00:00
Benjamin Glass	0414aeacd9	AOTInductor: silence linker warnings about executable stacks (#136924 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136924 Approved by: https://github.com/desertfire	2024-10-09 04:02:30 +00:00
Jane Xu	ddc7b6d0b4	Removes confusing note, addresses #38006 (#137535 ) Fixes #38006 The note was originally added in https://github.com/pytorch/pytorch/pull/30257, which tried to ensure that the gradient wasn't modified in the optimizer. This note creates more confusion than is helpful, so removing it is better than leaving it in, especially because most uses of closure that I know _does_ modify the grads. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137535 Approved by: https://github.com/albanD	2024-10-09 04:00:38 +00:00
Yifu Wang	d3edf4ebf4	[SymmetricMemoryOps] implement two-shot all-reduce (#137473 ) ## This Stack Implement custom all-reduce algos available in `IntraNodeComm` as `symm_mem` ops and replace the existing `IntraNodeComm` kernels with them. ## This PR Implement `symm_mem::two_shot_all_reduce_`. Later we'll replace the two-shot all-reduce in `IntraNodeComm` with these. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137473 Approved by: https://github.com/Chillee ghstack dependencies: #137471, #137472	2024-10-09 03:49:42 +00:00
Yifu Wang	82e55b624f	[SymmetricMemoryOps] implement one_shot_all_reduce (#137472 ) ## This Stack Implement custom all-reduce algos available in `IntraNodeComm` as `symm_mem` ops and replace the existing `IntraNodeComm` kernels with them. ## This PR Implement `symm_mem::one_shot_all_reduce` and `symm_mem::one_shot_all_reduce_out`. Later we'll replace the one-shot all-reduce in `IntraNodeComm` with these. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137472 Approved by: https://github.com/Chillee, https://github.com/weifengpy ghstack dependencies: #137471	2024-10-09 03:49:42 +00:00
Yifu Wang	5d83ee3e32	[SymmetricMemoryOps] refine cross-device barriers (#137471 ) ## This Stack Implement custom all-reduce algos available in `IntraNodeComm` as `symm_mem` ops and replace the existing `IntraNodeComm` kernels with them. ## This PR Refine the corss-device synchronization primitives to make it clearer when to use which synchronization pattern. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137471 Approved by: https://github.com/Chillee, https://github.com/weifengpy	2024-10-09 03:49:42 +00:00
Michael Lazos	5f1759a025	[Dynamo] add flex attention mode test (#137121 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137121 Approved by: https://github.com/yanboliang, https://github.com/anijain2305 ghstack dependencies: #137114, #137115, #137116, #137117, #137120, #137227, #137119	2024-10-09 02:29:40 +00:00
Michael Lazos	d5785d4295	[Dynamo] Handle torch function subclass/mode dispatch on generic tensor methods (#137119 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137119 Approved by: https://github.com/williamwen42, https://github.com/anijain2305 ghstack dependencies: #137114, #137115, #137116, #137117, #137120, #137227	2024-10-09 02:29:40 +00:00
Michael Lazos	0a304d9048	[Dynamo] Handle extracted unbound tensor methods (#137227 ) fixes2 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137227 Approved by: https://github.com/williamwen42, https://github.com/anijain2305 ghstack dependencies: #137114, #137115, #137116, #137117, #137120	2024-10-09 02:29:40 +00:00
Michael Lazos	b3f30c9bc3	[Dynamo] Move flex attention torch function mode to traceable HOP file (#137120 ) Moves `TransformGetItemToIndex` to a file where dynamo stores other traceable HOP concepts. (We don't trace through torch.* modules by default) Tracing through the mode required fixing a bug in dynamo autograd function, which fixed a graph break, which caused the autograd test failures (skipping for now and will file an issue) Previously those tests were in essence running in eager, because dynamo would fallback due to an arg mismatch error. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137120 Approved by: https://github.com/yanboliang, https://github.com/malfet ghstack dependencies: #137114, #137115, #137116, #137117	2024-10-09 02:29:40 +00:00
Michael Lazos	27dee935af	[Dynamo] Ensure torch function modes are dispatched on builtin ops (#137117 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137117 Approved by: https://github.com/yanboliang, https://github.com/williamwen42 ghstack dependencies: #137114, #137115, #137116	2024-10-09 02:29:40 +00:00
Michael Lazos	38afac2917	[Dynamo] Remove ignored modes from torch function mode stack guard (#135503 ) (#137116 ) Approved by: https://github.com/anijain2305 ghstack dependencies: #134732, #133137, #135443, #135444, #135422, #135502 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137116 Approved by: https://github.com/yanboliang ghstack dependencies: #137114, #137115	2024-10-09 02:29:40 +00:00
Michael Lazos	108b469f78	[Dynamo] Remove ignored modes workaround (#135502 ) (#137115 ) Approved by: https://github.com/anijain2305 ghstack dependencies: #134732, #133137, #135443, #135444, #135422 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137115 Approved by: https://github.com/yanboliang ghstack dependencies: #137114	2024-10-09 02:29:40 +00:00
Michael Lazos	e41dffbedd	[Dynamo] Trace enter/exit of TorchFunctionModes (#135422 ) (#137114 ) This PR implements tracing of with contexts with TorchFunction modes which have the default enter/exit behavior (ie pushing/popping the mode) Typically the bytecode for a context manager looks like this during a graph break: 1. graph call 2. enter context 3. unsupported code 4. exit context 5. resume call resume fn structure: 1. enter context 2. jump ... 3. exit context The issue with torch function modes is that side effects will replay any mutations to the torch function stack performed during tracing. So, we do not need to enter and exit around the unsupported code in the original function (doing so would result in a duplicate torch function mode entry during execution of the unsupported code), and we don't need to enter again in the resume function (the mode that was pushed from the side effects bytecode would still be on the stack). So for torch function modes the structure of our output code is this: 1. graph call 2. mutate tf mode stack to replay mutations 4. unsupported code 5. on exception restore stack 6. resume function Then our resume fn looks like this: 1. no-op enter torch function mode 2. jump 3. exit tf mode To implement the no-op enter of the torch function mode I added torch function mode in polyfill which no-op enters, but normally exits. This is needed because we still want to trace the with context in the resume function, and exit properly (the exit instructions will still be in the function, so we need to generate instructions to set up the context). Separately from the bytecode, dynamo also tracks contexts on the block stack, which is how the SETUP_* instructions are implemented. Naturally at a graph break, we exit these block stacks to properly reset the contexts entirely, so that we can re-enter around the unsupported code soundly. However once again, in the torch function mode case, in the event of a graph we do not want to perform any exit side effects because we want to preserve the state of the mode stack as is so that we will properly update the stack with bytecode mentioned in the first section. If we exited here, dynamo would pop the mode off of the symbolic stack, and not update the true python torch function mode stack with the suffix bytecode. All in all, for torch function modes we enter exactly once, update the global torch function mode stack with side effects bytecode, re-read this stack when compiling the resume function, and exit exactly once in the resume function. This matches the semantics of eager exactly. Approved by: https://github.com/williamwen42 ghstack dependencies: #134732, #133137, #135443, #135444 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137114 Approved by: https://github.com/yanboliang	2024-10-09 02:29:40 +00:00
leslie-fang-intel	0b8048c78a	Fix AOTI CPP GEMM Template issue without freezing (#136421 ) Summary Fix issue: https://github.com/pytorch/pytorch/issues/135106. For AOTI, there is the Inductor IR of weight ``` ReinterpretView( StorageBox( ConstantBuffer(name='L__self___mlp_0_weight', layout=FixedLayout('cpu', torch.float32, size=[64, 128], stride=[128, 1])) ), FixedLayout('cpu', torch.float32, size=[128, 64], stride=[1, 128]), origins=OrderedSet([addmm]) ) ``` In the post-processing step of the GEMM template, the used weight was before permutation, leading to correctness issues. In this PR, we address this by reshaping the weight to the expected size and stride before the weight prepack. Test Plan ``` python -u -m pytest -s -v test/inductor/test_aot_inductor.py -k test_misc_1_max_autotune_True_non_abi_compatible_cpu python -u -m pytest -s -v test/inductor/test_cpu_select_algorithm.py -k test_aoti_linear python -u -m pytest -s -v test/inductor/test_cpu_select_algorithm.py -k test_aoti_linear_multi_view_operations ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/136421 Approved by: https://github.com/jgong5, https://github.com/desertfire	2024-10-09 02:19:07 +00:00
FFFrog	be0b75256a	Make Context to be Device-agnostic Step by Step (1/N) (#136519 ) - make init to be device-agnostic and move it to AcceleratorHooksInterface - refactoring context related to device initialization Pull Request resolved: https://github.com/pytorch/pytorch/pull/136519 Approved by: https://github.com/ezyang, https://github.com/EikanWang, https://github.com/guangyey	2024-10-09 02:13:36 +00:00
Chirag Pandya	384ddab294	[c10d] fix sequence numbers for coalesced operations (#135132 ) Summary: We were erroneously incrementing seq_collective for p2p operations. FIxes issue #134833 Test Plan: Unit tests. TODO: add more unit tests Reviewers: Subscribers: Tasks: Tags: Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/135132 Approved by: https://github.com/fduwjj	2024-10-09 01:38:12 +00:00
Sam Larsen	8cbb58cff6	[inductor] Limit cpu copies in autotuning to CUDA devices (#137509 ) Summary: Missed in https://github.com/pytorch/pytorch/pull/136701#discussion_r1792328849: we should perform this optimization only for mutated args on cuda devices Test Plan: `python benchmarks/dynamo/timm_models.py --performance --inductor --device cuda --inference --bfloat16 --print-compilation-time --print-memory --cold-start-latency --only fbnetc_100` Pull Request resolved: https://github.com/pytorch/pytorch/pull/137509 Approved by: https://github.com/int3, https://github.com/eellison	2024-10-09 01:31:58 +00:00
Parikshit Shah	932b9945c0	[AutoAC] Backward Pass Aware AC - changes to partitioner to acommodate SOLVER as a callable (#137314 ) Summary: making it so that the config can pass `config.activation_memory_budget_solver` as a callable method and then that callable is invoked to determine the set of saved/recomputed nodes. Test Plan: tbd Reviewed By: Chillee, basilwong Differential Revision: D63714905 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137314 Approved by: https://github.com/eellison, https://github.com/basilwong Co-authored-by: Parikshit Shah <parikshit@meta.com>	2024-10-09 00:39:29 +00:00
Ke Wen	23c531b3e9	Allow parallelize_module to get device_mesh from ambient context (#134247 ) This PR is for supporting calling `parallelize_module` from within a model definition, making the model a parallel one. Calling `parallelize_module` is an alternative to maintaining a set of `ColumnWiseLinear`, `RowWiseLinear`, etc, while still being able to directly author a parallel model. (The motivation for authoring a parallel model is that there may be other distributed operations, which may not be easily captured by any module, see the forward function below. Alternatively speaking, the purpose is to exploit the expressiveness of DTensor -- we need to first create DTensors before calling ops on them. Having parallelized modules in model is one way of creating DTensors.) For example: ``` class FeedForward(nn.Module): def __init__(self, config: TransformerArgs) -> None: super().__init__() w1 = nn.Linear(config.dim, config.hidden_dim, bias=False) w2 = nn.Linear(config.hidden_dim, config.dim, bias=False) w3 = nn.Linear(config.dim, config.hidden_dim, bias=False) self.w1 = parallelize_module(w1, Colwise) self.w2 = parallelize_module(w2, Rowwise) self.w3 = parallelize_module(w3, Colwise) def forward(self, x: Tensor) -> Tensor: y: DTensor = self.w2(F.silu(self.w1(x)) * self.w3(x)) # y is a DTensor with Partial placement; we can return it as is. return y # Or we can convert it to Replicate -- there is modeling flexibility here. return y.redistribute(Replicate()) with device_mesh: model = FeedForward(config) # Now model is a model parallelized onto device_mesh y = model(x) ``` The `device_mesh` actually used for `parallelize_module` would be retrieved from the ambient context. Calling `parallelize_module` from within model hierarchy also saves the use of FQNs as in the out-of-model annotation case. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134247 Approved by: https://github.com/tianyu-l	2024-10-09 00:19:03 +00:00
Zhenbin Lin	de7f32a205	openreg add pin_memory (#135339 ) Occording to `Next steps` in test/cpp_extensions/open_registration_extension/README.md, add Pinned memory and HostAllocator. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135339 Approved by: https://github.com/albanD	2024-10-09 00:07:59 +00:00
eellison	8893881867	Invalidate StorageImpl instances when tensor is overwritten with cudagraphs (#125264 ) Fixes #104435 Pull Request resolved: https://github.com/pytorch/pytorch/pull/125264 Approved by: https://github.com/ezyang Co-authored-by: eellison <elias.ellison@gmail.com>	2024-10-09 00:05:52 +00:00
eqy	cba3f4f5e3	[CUDA] Clean up asserts in `test_cuda.py` (#137034 ) Switch some `assertTrue` tests to `assertEqual` etc for debuggability in logs Pull Request resolved: https://github.com/pytorch/pytorch/pull/137034 Approved by: https://github.com/Skylion007	2024-10-08 23:16:19 +00:00
Jane Xu	b16167874d	Minor SGD docs clarification fixing #137356 , #137352 (#137528 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137528 Approved by: https://github.com/albanD	2024-10-08 23:05:08 +00:00
Duygu Altinok	2a1829d728	Error message for allow_in_graph decorator and arbitrary function combo (#135972 ) Fixes #103615 Quick error message for non-allowed allow_in_graph decorator and arbitrary function combo. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135972 Approved by: https://github.com/anijain2305	2024-10-08 22:48:38 +00:00
eellison	4aed81c0db	Add support for cat memory planning mms with max autotune (#132554 ) When we are autotuning matmuls the aten.mm and the triton template choices take in an externally allocated tensor that can be a view into a pre-planned aten.cat. So long as the output shape and stride of the matmul matches the slice of the cat we're planning, we can realize the mm directly into the cat. Discussion for reviewers: It feels a little bit odd that in the existing code we set the output of aten.mm as [FlexibleLayout](`bcac71517c/torch/_inductor/kernel/mm.py (L156)`). While is this correct, it might lead to passing non performant output strides to cublas.. I guess this is better than a copy ? Not sure. We could also introduce a Layout that denotes a Fixed shape and stride which we control allocation ``` class AllocatedFixedLayout(FixedLayout) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/132554 Approved by: https://github.com/jansel	2024-10-08 22:36:46 +00:00
Tugsbayasgalan Manlaibaatar	02013da038	Lift restriction on training IR for unflatten (#137470 ) Differential Revision: [D64025578](https://our.internmc.facebook.com/intern/diff/D64025578) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137470 Approved by: https://github.com/avikchaudhuri	2024-10-08 22:30:24 +00:00
Justin Chu	81c8a8ada6	[ONNX] Bump onnxscript in CI (#137497 ) To 0.1.0.dev20241008 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137497 Approved by: https://github.com/titaiwangms	2024-10-08 21:56:30 +00:00
Joel Schlosser	76ab1ab665	Fix autograd.Function + NJT when an output grad is None (#136875 ) For `autograd.Function`, the engine will try to allocate correctly-shaped zeros for `None` grads (i.e. in the case where the output isn't used downstream). It determines the shape of these zeros from the `VariableInfo` entry, which is derived from the forward output shape. For the NJT forward output case, the size info stored will contain a nested int, and calling `zeros()` with this size throws: ``` RuntimeError: .../build/aten/src/ATen/RegisterCPU.cpp:5260: SymIntArrayRef expected to contain only concrete integers ``` This PR fixes this by storing the full tensor in the `VariableInfo` for the nested case and calling `zeros_like()` to allocate correctly-shaped zeros. This is pretty inefficient; ideally we would want to save just the NJT shape and be able to construct zeros from it, but this requires factory function support for nested ints (WIP). So this is a short-term fix until we have that. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136875 Approved by: https://github.com/soulitzer	2024-10-08 21:01:36 +00:00
PyTorch MergeBot	5e3e1c0151	Revert "[FSDP2] Required `mesh_dim_names` for HSDP (#137436 )" This reverts commit 5fb30df7d6ecc25cc7c4c17a8a33d14ddaa7c279. Reverted https://github.com/pytorch/pytorch/pull/137436 on behalf of https://github.com/malfet due to Looks like it broke distributed testing, see https://github.com/pytorch/pytorch/actions/runs/11239761070/job/31249854217 ([comment](https://github.com/pytorch/pytorch/pull/137436#issuecomment-2400794929))	2024-10-08 20:50:49 +00:00
Edward Z. Yang	b499083a91	Get rid of quadratic tests to has_same_metadata (#136857 ) Fixes https://github.com/pytorch/pytorch/issues/136852 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/136857 Approved by: https://github.com/isuruf, https://github.com/bdhirsh	2024-10-08 20:49:23 +00:00
PyTorch MergeBot	d34b617bb9	Revert "[Dynamo] Trace enter/exit of TorchFunctionModes (#135422 ) (#137114 )" This reverts commit 51bc839b94829f176e3c1b7f62e3448d6028c480. Reverted https://github.com/pytorch/pytorch/pull/137114 on behalf of https://github.com/huydhn due to The top of the stack has been reverted but it leaves trunk in a broken state, so I try to revert the rest of the stack ([comment](https://github.com/pytorch/pytorch/pull/137114#issuecomment-2400765603))	2024-10-08 20:33:17 +00:00
PyTorch MergeBot	8c937445ee	Revert "[Dynamo] Remove ignored modes workaround (#135502 ) (#137115 )" This reverts commit b1fd7708bd81d8d52908bf4459ed024471abd803. Reverted https://github.com/pytorch/pytorch/pull/137115 on behalf of https://github.com/huydhn due to The top of the stack has been reverted but it leaves trunk in a broken state, so I try to revert the rest of the stack ([comment](https://github.com/pytorch/pytorch/pull/137114#issuecomment-2400765603))	2024-10-08 20:33:17 +00:00
PyTorch MergeBot	e5f9131327	Revert "[Dynamo] Remove ignored modes from torch function mode stack guard (#135503 ) (#137116 )" This reverts commit f9d69cde88ad972ee8fc24413dd0740f4e21562d. Reverted https://github.com/pytorch/pytorch/pull/137116 on behalf of https://github.com/huydhn due to The top of the stack has been reverted but it leaves trunk in a broken state, so I try to revert the rest of the stack ([comment](https://github.com/pytorch/pytorch/pull/137114#issuecomment-2400765603))	2024-10-08 20:33:17 +00:00
PyTorch MergeBot	2d18c2d5e7	Revert "[Dynamo] Ensure torch function modes are dispatched on builtin ops (#137117 )" This reverts commit 941be418d8ec3290d0e3bae0e16a443be26b3075. Reverted https://github.com/pytorch/pytorch/pull/137117 on behalf of https://github.com/huydhn due to The top of the stack has been reverted but it leaves trunk in a broken state, so I try to revert the rest of the stack ([comment](https://github.com/pytorch/pytorch/pull/137114#issuecomment-2400765603))	2024-10-08 20:33:17 +00:00
Edward Z. Yang	cc75ac084f	Add test for https://github.com/pytorch/pytorch/issues/137087 (#137090 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/137090 Approved by: https://github.com/Skylion007, https://github.com/albanD	2024-10-08 20:17:03 +00:00
PyTorch MergeBot	5349ee2934	Revert "Parametrize test_lstm_packed (#137447 )" This reverts commit d5493ed579ba41015ffef981832a3f04f94bb6f8. Reverted https://github.com/pytorch/pytorch/pull/137447 on behalf of https://github.com/huydhn due to Need to up few more instance to 4xlarge, revert to reland ([comment](https://github.com/pytorch/pytorch/pull/137447#issuecomment-2400737602))	2024-10-08 20:15:24 +00:00
James Wu	3c1ab93678	Log chromium event for automatic dynamic reasons (#137491 ) Log a chromium event so that we can see the reasons for invoking automatic dynamic shapes in aggregate internally. Run following code: ``` import torch @torch.compile(backend="eager") def foo(t, x): return t.sin() + x torch._dynamo.config.automatic_dynamic_shapes = True torch._dynamo.config.assume_static_by_default = True # Change size x = torch.randn([1,2]) foo(x, 2) x = torch.randn([2,2]) foo(x, 2) torch._dynamo.reset() # Change dimensionality x = torch.randn([1,2]) foo(x, 2) x = torch.randn([1,2,3]) foo(x, 2) torch._dynamo.reset() # Change stride x = torch.randn([3,3]) foo(x, 2) x = torch.as_strided(x, [3,3], [2,2]) foo(x, 2) torch._dynamo.reset() # Change scalar x = torch.randn([1,2]) foo(x, 2) foo(x, 3) ``` Internal link to perfetto: https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html?url=https%3A%2F%2Finterncache-all.fbcdn.net%2Fmanifold%2Ftlparse_reports%2Ftree%2Flogs%2Fjjwu%2Fcustom%2Fchromium_events.json#!/viewer?url=https%3A%2F%2Finterncache-all.fbcdn.net%2Fmanifold%2Ftlparse_reports%2Ftree%2Flogs%2Fjjwu%2Fcustom%2Fchromium_events.json&local_cache_key The events look like this: <img width="639" alt="image" src="https://github.com/user-attachments/assets/23916333-7f24-47c7-934b-201f33aebeab"> <img width="638" alt="image" src="https://github.com/user-attachments/assets/9f927c8d-04bb-4431-8802-685b032df656"> <img width="640" alt="image" src="https://github.com/user-attachments/assets/342e9e11-0dfc-422d-bd0b-01a8574d38ba"> <img width="635" alt="image" src="https://github.com/user-attachments/assets/dc2c97cd-7180-4069-b019-d6e63ee490bc"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/137491 Approved by: https://github.com/Skylion007, https://github.com/oulgen	2024-10-08 19:53:12 +00:00
cyy	a2396b2dd8	[2/N] Fix extra warnings brought by clang-tidy-17 (#137459 ) Follows #137407 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137459 Approved by: https://github.com/Skylion007	2024-10-08 19:05:02 +00:00
Brian Hirsh	b41fc14072	compile time benchmarks for AOTDispatcher (partitioner) (#136760 ) compile time benchmark for the min cut partitioner. I'm hoping that this is a reasonable benchmark because: (1) it consists of a single input + many weights that are used sequentially (2) contains a mix of recompute vs non-recomputed ops (matmul + sin) (3) it is relatively simple from running locally: ``` collecting compile time instruction count for aotdispatcher_partitioner_cpu compile time instruction count for iteration 0 is 21764219181 compile time instruction count for iteration 1 is 12475020009 compile time instruction count for iteration 2 is 12463710140 compile time instruction count for iteration 3 is 12455676489 compile time instruction count for iteration 4 is 12451344330 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/136760 Approved by: https://github.com/ezyang ghstack dependencies: #136759	2024-10-08 18:44:13 +00:00
Brian Hirsh	48b8f818b2	compile time benchmarks for AOTDispatcher (inference/training/subclasses) (#136759 ) this adds a few compile time benchmarks for some disjoint paths in AOTDispatcher: (1) inference vs training code paths (2) "subclasses" vs "no subclasses" codepaths Also see https://github.com/pytorch/pytorch/pull/136760 for a partitioner benchmark (I'm not sure why ghstack didn't display the stack nicely) I ran locally, and got these numbers on the 4 paths: ``` collecting compile time instruction count for aotdispatcher_inference_nosubclass_cpu compile time instruction count for iteration 0 is 11692348671 compile time instruction count for iteration 1 is 3026287204 compile time instruction count for iteration 2 is 3011467318 compile time instruction count for iteration 3 is 3004485935 compile time instruction count for iteration 4 is 3003087410 collecting compile time instruction count for aotdispatcher_training_nosubclass_cpu compile time instruction count for iteration 0 is 6068003223 compile time instruction count for iteration 1 is 5585418102 compile time instruction count for iteration 2 is 5581856618 compile time instruction count for iteration 3 is 5581651794 compile time instruction count for iteration 4 is 5578742619 collecting compile time instruction count for aotdispatcher_inference_subclass_cpu compile time instruction count for iteration 0 is 8634984264 compile time instruction count for iteration 1 is 8633467573 compile time instruction count for iteration 2 is 8632182092 compile time instruction count for iteration 3 is 8632056925 compile time instruction count for iteration 4 is 8632543871 collecting compile time instruction count for aotdispatcher_training_subclass_cpu compile time instruction count for iteration 0 is 14737239311 compile time instruction count for iteration 1 is 14734346427 compile time instruction count for iteration 2 is 14736493730 compile time instruction count for iteration 3 is 14734121272 compile time instruction count for iteration 4 is 14733852882 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/136759 Approved by: https://github.com/laithsakka	2024-10-08 18:44:13 +00:00
Brian Hirsh	53af729a66	add meta for _segment_reduce_backward (#137442 ) reland of https://github.com/pytorch/pytorch/pull/124988 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137442 Approved by: https://github.com/albanD	2024-10-08 18:40:06 +00:00
Edward Z. Yang	1aac1ffce1	Don't generate implicit value ranges for missing symbols. (#136667 ) Instead, callback to a missing handler when needed. This greatly speeds things up with the value ranges dict is large. The missing handler is needed because nested ints don't have VRs, but symbolic sizes involving them occasionally show up in compute. ``` TORCHDYNAMO_EXTENDED_DEBUG_CREATE_SYMBOL="s11" TORCH_LOGS=dynamic PYTORCH_TEST_WITH_DYNAMO=1 python test/test_nestedtensor.py TestNestedTensorAutogradCPU.test_dropout_backward_jagged_cpu ``` Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/136667 Approved by: https://github.com/isuruf ghstack dependencies: #136429	2024-10-08 18:12:57 +00:00
Edward Z. Yang	90bed32b98	Introduce torch.sym_sum (#136429 ) Partially addresses https://github.com/pytorch/pytorch/issues/128150 When you have big sums of values, we end up computing long chains of binary addition in our FX graph representation. Not only is this ugly, it also is quadratic, as the sympy.Add constructor is O(N) in number of arguments. Instead, ensure that we maintain the summation as a single FX node so we can do the entire addition all in one go. update_hint_regression benchmark, before and after: ``` update_hint_regression,compile_time_instruction_count,2648328980 update_hint_regression,compile_time_instruction_count,2563748678 ``` Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/136429 Approved by: https://github.com/isuruf	2024-10-08 18:12:57 +00:00
James Wu	3bf6594d13	Log compile ids to pt2_remote_cache and pt2_compile_events (#137431 ) Log the current compilation id for all relevant samples for these two tables, so we can have a 1:1 analog with dynamo_compile. Differential Revision: [D63900826](https://our.internmc.facebook.com/intern/diff/D63900826/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137431 Approved by: https://github.com/oulgen	2024-10-08 18:04:48 +00:00
Yuanhao Ji	758dbac308	Add type check for `ord` in `torch.linalg.vector_norm()` and `torch.linalg.matrix_norm()` (#137463 ) fixes #137424, fixes #137460 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137463 Approved by: https://github.com/lezcano	2024-10-08 17:53:56 +00:00
Shivam Raikundalia	d87835ac32	[Profiler] Clear Out Dangling AppendOnlyLists (#137450 ) Summary: There are two instances of AppendOnlyLists that don't get cleared after we have finished iterating through the forward lists. This can be potentially dangerous since they can last for the entirety of the lifespan of the profiler. We have also seen crashes during the destructor of these variables when the profiler is exiting. This could possibly be related to the fact that the default constructor assumes some valid state of these lists rather than whatever state they are in when profiler is exiting. Test Plan: Ran with profile_memory=True to make sure allocations queue gets cleared correctly and trace+workload ran successfully Differential Revision: D64010911 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137450 Approved by: https://github.com/aaronenyeshi	2024-10-08 17:48:59 +00:00
PyTorch MergeBot	7e8dace0de	Revert "[ROCm] remove caffe2 from hipify (#137157 )" This reverts commit 40d826074546558f6665a4c118335a7725503cac. Reverted https://github.com/pytorch/pytorch/pull/137157 on behalf of https://github.com/xw285cornell due to this is breaking internal where we still use caffe2 ([comment](https://github.com/pytorch/pytorch/pull/137157#issuecomment-2400466131))	2024-10-08 17:45:45 +00:00
PyTorch MergeBot	a8047564ff	Revert "[FlexAttention] Support training bias for eager (#136910 )" This reverts commit 711dacf9845cbc9ea8b3b0fa257309930106712f. Reverted https://github.com/pytorch/pytorch/pull/136910 on behalf of https://github.com/malfet due to torch.library.custom_op looks weird here and it breaks some internal workloads ([comment](https://github.com/pytorch/pytorch/pull/136910#issuecomment-2400434833))	2024-10-08 17:29:02 +00:00
PyTorch MergeBot	0b5ade8a12	Revert "[Dynamo] Move flex attention torch function mode to traceable HOP file (#137120 )" This reverts commit 68151fd2889c9752348c2dfdc7c175ee201c0cd3. Reverted https://github.com/pytorch/pytorch/pull/137120 on behalf of https://github.com/malfet due to Need to revert to be able to revert https://github.com/pytorch/pytorch/pull/136910 ([comment](https://github.com/pytorch/pytorch/pull/137120#issuecomment-2400429265))	2024-10-08 17:26:19 +00:00
PyTorch MergeBot	2570d77a26	Revert "type _dynamo/trace_wrapped_higher_order_op.py (#137354 )" This reverts commit a9f7b905de2217eedee6723b0eb83b3ac7406c26. Reverted https://github.com/pytorch/pytorch/pull/137354 on behalf of https://github.com/malfet due to Need to revert to be able to revert https://github.com/pytorch/pytorch/pull/136910 ([comment](https://github.com/pytorch/pytorch/pull/137354#issuecomment-2400424669))	2024-10-08 17:22:40 +00:00
PyTorch MergeBot	76c5bdd2cc	Revert "[Dynamo] Handle extracted unbound tensor methods (#137227 )" This reverts commit 14eabd69152e31d059444310979625542db2aece. Reverted https://github.com/pytorch/pytorch/pull/137227 on behalf of https://github.com/malfet due to Need to revert to be able to revert https://github.com/pytorch/pytorch/pull/136910 ([comment](https://github.com/pytorch/pytorch/pull/137227#issuecomment-2400406384))	2024-10-08 17:12:41 +00:00
PyTorch MergeBot	c88c0e6c65	Revert "[Dynamo] Handle torch function subclass/mode dispatch on generic tensor methods (#137119 )" This reverts commit d255b34c0ac6208633ed5e71d019fa9ae061e1fc. Reverted https://github.com/pytorch/pytorch/pull/137119 on behalf of https://github.com/malfet due to Need to revert to be able to revert https://github.com/pytorch/pytorch/pull/136910 ([comment](https://github.com/pytorch/pytorch/pull/137119#issuecomment-2400401262))	2024-10-08 17:09:26 +00:00
PyTorch MergeBot	cc10ef4645	Revert "[Dynamo] add flex attention mode test (#137121 )" This reverts commit 144665d772f7ec014a4a23f460a632a4a4774f4a. Reverted https://github.com/pytorch/pytorch/pull/137121 on behalf of https://github.com/malfet due to Need to revert to be able to revert https://github.com/pytorch/pytorch/pull/136910 ([comment](https://github.com/pytorch/pytorch/pull/137121#issuecomment-2400389882))	2024-10-08 17:03:34 +00:00
PyTorch MergeBot	11192ceca4	Revert "[FlexAttention] only calculate grads for buffers that require_grad (#137451 )" This reverts commit 9f9d252971ea1de04d349a0460e39e3bfe824eae. Reverted https://github.com/pytorch/pytorch/pull/137451 on behalf of https://github.com/malfet due to Need to revert it in order to be able to backout https://github.com/pytorch/pytorch/pull/136910 ([comment](https://github.com/pytorch/pytorch/pull/137451#issuecomment-2400385858))	2024-10-08 17:00:59 +00:00
eellison	8184e202d8	Update mutation checking in pattern matcher (#137448 ) Fix for https://github.com/pytorch/pytorch/issues/137229 The current mutation checking is complicated because it works for pre grad IR. When pre grad ir has been traced to OpOverloads checking is much easier. I am also special casing auto functional hop although I discussed with @zou3519 it would be nice to have a way of querying HOPs that mimic schemas. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137448 Approved by: https://github.com/zou3519	2024-10-08 16:56:40 +00:00
Avik Chaudhuri	28493efe6e	fix silly mapping issue with torch.Size (#137465 ) Test Plan: added test Differential Revision: D64022949 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137465 Approved by: https://github.com/yushangdi, https://github.com/angelayi	2024-10-08 16:53:15 +00:00
xadupre	7267363844	[ONNX] Insert contiguous node between transpose and view before calling run_decompositions (#137340 ) Works around #136543. This fix solves the issue only in the context of the ONNX exporter but this issue happens in other context. The bug happens when method `run_decompositions` is called. The failing pattern is assumed to be ``view(transpose(x, ...))``. This pattern is replaced by ``view(flatten(transpose(x, ..)))``. By changing the dimensions, the strides are updated as well and `run_decompositions` does not fail anymore. It would be inefficient on a 1D tensor but then transpose would not be used. The extra node appears in the final onnx graph but is removed after optimization. The final onnx graph should not be impacted and no performance loss should be observed for the onnx model. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137340 Approved by: https://github.com/justinchuby Co-authored-by: Justin Chu <justinchuby@users.noreply.github.com>	2024-10-08 16:45:59 +00:00
Andrew Gu	5fb30df7d6	[FSDP2] Required `mesh_dim_names` for HSDP (#137436 ) Two changes: 1. Require `mesh_dim_names` if using HSDP 2. Pass only the shard mesh to `fsdp_pre_all_gather` Change 1 is technically BC breaking, but it should not be hard to fix on the user side. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137436 Approved by: https://github.com/weifengpy, https://github.com/wz337	2024-10-08 16:31:18 +00:00
Shangdi Yu	0bfedb13e7	Remove aoti_torch_zero_ codegen (#137371 ) Summary: aoti_torch_zero_ codegen breaks AOTI FC, see discussion in D63281798. Test Plan: CI Differential Revision: D63916320 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137371 Approved by: https://github.com/jingsh	2024-10-08 15:57:41 +00:00
Bin Bao	c04b35a5ae	[AOTI] Add standalone version of TORCH_CHECK (#136873 ) Summary: In the standalone mode, TORCH_CHECK throws std::runtime_error, instead of c10::Error. The goal is to cut dependency on libtorch. Specifically, AOTI generates CPU code which may call ATen vectorization ops and we need to make sure those ops are self-contained. Differential Revision: [D63911928](https://our.internmc.facebook.com/intern/diff/D63911928) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136873 Approved by: https://github.com/albanD, https://github.com/chenyang78	2024-10-08 15:30:01 +00:00
Huy Do	d5493ed579	Parametrize test_lstm_packed (#137447 ) The test runs all its combination (512) sequentially, so it takes more than 30 minutes to finish or timeout on ASAN after one hour. Parametrizing it will break it up, so individual tests can finish and aren't need to be marked as slow anymore. Also, the test seems to run OOM on a 2xlarge with `std::bad_alloc` memory error. Maybe, this would also fix the issue (pending CI testing) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137447 Approved by: https://github.com/albanD, https://github.com/malfet	2024-10-08 15:26:27 +00:00
Joel Schlosser	3e2f276a14	Fix to() on non-contiguous NJTs (#137124 ) Called out via torchrec integration: `lengths` is not handled properly. Future work (not related to non-contiguous NJTs): #137275 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137124 Approved by: https://github.com/soulitzer ghstack dependencies: #137030, #137031	2024-10-08 15:11:05 +00:00
Edward Z. Yang	a77bb8527c	Make index check in applySelect support deferred runtime assert (#137046 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/137046 Approved by: https://github.com/albanD	2024-10-08 14:31:47 +00:00
Thanh Ha	9b2e453e24	Migrate ARM64 Linux binary jobs to runner determinator (#136666 ) Updates ARM64 Linux binary jobs to use the runner determinator. Issue: pytorch/ci-infra#265 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136666 Approved by: https://github.com/ZainRizvi	2024-10-08 12:14:06 +00:00
Shuqiang Zhang	76dca1fef3	[c10d] separate the codes for GPU stream synchronization and CPU thread synchronization (#137295 ) code Summary: This PR should not change the existing behavior of work.wait(), just separate the stream synchronization code from the CPU busy wait code. Also, remove the need of a private synchronization function. In a longer term, we would like to give user the flexibility of bypassing the watchdog thread and handle the collective error by themselves. Test Plan: python test/distributed/test_c10d_nccl.py NcclErrorHandlingTest Pull Request resolved: https://github.com/pytorch/pytorch/pull/137295 Approved by: https://github.com/kwen2501	2024-10-08 08:53:47 +00:00
drisspg	9f9d252971	[FlexAttention] only calculate grads for buffers that require_grad (#137451 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137451 Approved by: https://github.com/Chillee	2024-10-08 07:36:38 +00:00
Xuehai Pan	59cdd8ddf1	Bump optree version to 0.13.0 to enable Python 3.13 and Python 3.13t support (#137396 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137396 Approved by: https://github.com/albanD	2024-10-08 06:49:04 +00:00
PyTorch MergeBot	493d0eeef3	Revert "Add support for cat memory planning mms with max autotune (#132554 )" This reverts commit d558ec07300defee24dd4a83ab4b387a39ea2176. Reverted https://github.com/pytorch/pytorch/pull/132554 on behalf of https://github.com/huydhn due to Sorry for reverting your change but I think it is failing on ROCm ([comment](https://github.com/pytorch/pytorch/pull/132554#issuecomment-2398946854))	2024-10-08 06:21:06 +00:00
Michael Lazos	8ca15e87f5	Update torchbind expecttest from landrace (#137453 ) Update expecttest from torch function mode PR landrace (torch function mode changes output code slightly) Attempted to revert the stack but there were conflicts Pull Request resolved: https://github.com/pytorch/pytorch/pull/137453 Approved by: https://github.com/huydhn	2024-10-08 06:01:29 +00:00
Tugsbayasgalan Manlaibaatar	bb31e3f57e	Add original forward names to schema so that prettify pass works (#136887 ) When we run_decomp, we retrace if it is training IR. As a result, we do need to reliably store the oroiginal forward names when we run decomp. Differential Revision: [D63064453](https://our.internmc.facebook.com/intern/diff/D63064453/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136887 Approved by: https://github.com/angelayi	2024-10-08 04:21:02 +00:00
Zhenbin Lin	46525abb71	OpenReg: support multiple executors (#136249 ) From PR https://github.com/pytorch/pytorch/pull/135646 we have split the daemon into drvier/executor, however, current executor stands for all devices and allocate memory all together. In order to better simulate device behavior, here we support multiple executors, each executor stands for one device. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136249 Approved by: https://github.com/FFFrog, https://github.com/albanD	2024-10-08 01:37:08 +00:00
Bob Ren	395e098209	type _dynamo/mutation_guard.py (#137350 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137350 Approved by: https://github.com/Skylion007	2024-10-08 00:04:34 +00:00
Max Podkorytov	52ba40c6f6	[ROCm][AOTI] add CK backend (#135641 ) Companion to #134379 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135641 Approved by: https://github.com/ColinPeppler, https://github.com/chenyang78 Co-authored-by: Colin Peppler <colinpeppler@meta.com>	2024-10-07 23:53:58 +00:00
Colin Peppler	2c0b11c79b	forward-fix D63916220 breaking test_cutlass_backend in FBCode (#137435 ) Summary: It seems like the import path is different from FBCode & OSS. Wondering how to consolidate them. Test Plan: ``` buck2 test 'fbcode//mode/opt' fbcode//caffe2/test/inductor:cutlass_backend Tests finished: Pass 2. Fail 0. Fatal 0. Skip 33. Build failure 0 ``` Differential Revision: D63991961 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137435 Approved by: https://github.com/jovianjaison	2024-10-07 23:44:04 +00:00
Yuanhao Ji	812f286d4a	Delete duplicate bindings in torch/csrc/autograd/python_torch_functions_manual.cpp (#136711 ) This change deletes the duplicate binding of `torch. _functionalize_mark_mutation_hidden_from_autograd()`, another defination is here: `5c78c6b05a/torch/csrc/autograd/python_torch_functions_manual.cpp (L630-L636)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/136711 Approved by: https://github.com/soulitzer	2024-10-07 23:19:06 +00:00
eellison	d558ec0730	Add support for cat memory planning mms with max autotune (#132554 ) When we are autotuning matmuls the aten.mm and the triton template choices take in an externally allocated tensor that can be a view into a pre-planned aten.cat. So long as the output shape and stride of the matmul matches the slice of the cat we're planning, we can realize the mm directly into the cat. Discussion for reviewers: It feels a little bit odd that in the existing code we set the output of aten.mm as [FlexibleLayout](`bcac71517c/torch/_inductor/kernel/mm.py (L156)`). While is this correct, it might lead to passing non performant output strides to cublas.. I guess this is better than a copy ? Not sure. We could also introduce a Layout that denotes a Fixed shape and stride which we control allocation ``` class AllocatedFixedLayout(FixedLayout) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/132554 Approved by: https://github.com/jansel	2024-10-07 22:49:29 +00:00
Ludvig Bergenstråhle	01bf350967	Fix bmm_sparse_cuda illegal memory access (#131977 ) This PR fixes a bug in `search_end_matrix_indices_cuda_kernel` causing an illegal memory access when calling `bmm_sparse_cuda` on a sparse matrix with no non-zero values in the first batch dimension. Reproducible example: ```py import torch ind = torch.tensor([[1], [0], [0]], device="cuda") val = torch.tensor([1.], device="cuda") A = torch.sparse_coo_tensor(ind, val, size=(2, 1, 1)) B = torch.zeros((2, 1, 1), device="cuda") C = torch.bmm(A, B) ``` ## Details In the previous code, we may for example end up with the following situation: ``` i : indices_1D[i] ------------------------------------------ 0 : 1 <- start_idx, mid_idx 1 : 1 <- end_idx ... ``` When `target_mat_num = 0`, the next iteration of the while loop will assign `-1` to `end_idx` and thus `(0 + (-1)) >> 1 = -1` to `mid_idx`, causing an access error on line 703. The updated code maintains the invariant `start_idx <= end_idx` and will not go out of bounds. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131977 Approved by: https://github.com/amjames, https://github.com/pearu, https://github.com/nikitaved	2024-10-07 22:47:34 +00:00
William Wen	a6707a7303	[dynamo] log all graph breaks to graph_breaks logging artifact (#137244 ) We were previously not logging all graph breaks (e.g. data dependent jumps) to the graph_breaks logging artifact. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137244 Approved by: https://github.com/jansel	2024-10-07 22:34:27 +00:00
Bob Ren	a9f7b905de	type _dynamo/trace_wrapped_higher_order_op.py (#137354 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137354 Approved by: https://github.com/Skylion007, https://github.com/jansel	2024-10-07 21:57:06 +00:00
PyTorch MergeBot	796c3c3415	Revert "Disallow FakeTensor.data_ptr access in eager mode (#137221 )" This reverts commit 7e13e7dd7e5fc20c0420605aeecb0f902af5326c. Reverted https://github.com/pytorch/pytorch/pull/137221 on behalf of https://github.com/jovianjaison due to failing internal tests ([comment](https://github.com/pytorch/pytorch/pull/137221#issuecomment-2397957081))	2024-10-07 21:46:13 +00:00
Sam Larsen	319eda9dfd	[inductor] Add API to make post_grad_custom passes cache-able (#137298 ) Summary: See https://github.com/pytorch/pytorch/issues/130772 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137298 Approved by: https://github.com/oulgen, https://github.com/eellison	2024-10-07 21:11:54 +00:00
Peter Y. Yeh	8aa110cb00	[ROCm] Enable int_mm_error tests for rocm 6.0+ (#124999 ) This pull request enables the int_mm_error tests for rocm 6.0+ . since #122431 landed Pull Request resolved: https://github.com/pytorch/pytorch/pull/124999 Approved by: https://github.com/jeffdaily, https://github.com/malfet	2024-10-07 21:10:18 +00:00
Huy Do	46abaa3b0f	Increase parallelnative shards to 4 (#137433 ) The job times out flakily in trunk as its duration is approaching 3.5h https://hud.pytorch.org/hud/pytorch/pytorch/main/1?per_page=50&name_filter=parallelnative Pull Request resolved: https://github.com/pytorch/pytorch/pull/137433 Approved by: https://github.com/wdvr, https://github.com/malfet	2024-10-07 21:06:34 +00:00
Sam Larsen	c87c9f0a01	[inductor] Conditionally copy args to cpu to minimize memory overhead of autotuning (#136701 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136701 Approved by: https://github.com/eellison	2024-10-07 19:47:04 +00:00
Ryan Guo	900f57216f	[dynamo] Log a summary of frames Dynamo traced (#137297 ) This patch adds logging for all frames Dynamo traced, during each invocation of a Dynamo-optimized function. Example: ```python import torch @torch.compile def foo(): x = torch.ones([10]) def bar(): y = x + x torch._dynamo.graph_break() z = y * x return z return bar(), bar foo() foo() ``` Running `TORCH_LOGS="dynamo" python` on the above dumps the following near the very end. ``` ...... I1003 12:18:31.058000 177 torch/_dynamo/eval_frame.py:486] starting from foo /Users/ryanguo99/Documents/work/scratch/test.py:4, torchdynamo attempted to trace the following frames: [ I1003 12:18:31.058000 177 torch/_dynamo/eval_frame.py:486] * foo /Users/ryanguo99/Documents/work/scratch/test.py:4 I1003 12:18:31.058000 177 torch/_dynamo/eval_frame.py:486] * bar /Users/ryanguo99/Documents/work/scratch/test.py:7 I1003 12:18:31.058000 177 torch/_dynamo/eval_frame.py:486] ] I1003 12:18:31.064000 177 torch/_dynamo/eval_frame.py:486] starting from foo /Users/ryanguo99/Documents/work/scratch/test.py:4, torchdynamo attempted to trace the following frames: [] ...... ``` Fixes #118262. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137297 Approved by: https://github.com/williamwen42	2024-10-07 19:44:41 +00:00
Pian Pawakapan	f33ffd01f2	[export] fix joint graph metadata (#136011 ) Differential Revision: D62652832 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136011 Approved by: https://github.com/tugsbayasgalan	2024-10-07 19:36:44 +00:00
Jason Ansel	08b84afda9	[inductor] Fix alignment hint for WorkspaceArg (#137429 ) Alignment hints refer to the base ptr, not the size. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137429 Approved by: https://github.com/eellison	2024-10-07 19:32:33 +00:00
PyTorch MergeBot	fe44b6a67f	Revert "Add back DistributedDataParallel types that were lost when pyi was removed (#136835 )" This reverts commit 40b09edd87fcbe4e63c4db6399ec758d5c34e1b1. Reverted https://github.com/pytorch/pytorch/pull/136835 on behalf of https://github.com/jovianjaison due to this pr is causing typecheck errors internally ([comment](https://github.com/pytorch/pytorch/pull/136835#issuecomment-2397661940))	2024-10-07 18:59:41 +00:00
Michael Lazos	144665d772	[Dynamo] add flex attention mode test (#137121 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137121 Approved by: https://github.com/yanboliang ghstack dependencies: #137114, #137115, #137116, #137117, #137120, #137227, #137119	2024-10-07 18:55:26 +00:00
Michael Lazos	d255b34c0a	[Dynamo] Handle torch function subclass/mode dispatch on generic tensor methods (#137119 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137119 Approved by: https://github.com/williamwen42 ghstack dependencies: #137114, #137115, #137116, #137117, #137120, #137227	2024-10-07 18:55:26 +00:00
Michael Lazos	14eabd6915	[Dynamo] Handle extracted unbound tensor methods (#137227 ) fixes2 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137227 Approved by: https://github.com/williamwen42 ghstack dependencies: #137114, #137115, #137116, #137117, #137120	2024-10-07 18:55:26 +00:00
Michael Lazos	68151fd288	[Dynamo] Move flex attention torch function mode to traceable HOP file (#137120 ) Moves `TransformGetItemToIndex` to a file where dynamo stores other traceable HOP concepts. (We don't trace through torch.* modules by default) Tracing through the mode required fixing a bug in dynamo autograd function, which fixed a graph break, which caused the autograd test failures (skipping for now and will file an issue) Previously those tests were in essence running in eager, because dynamo would fallback due to an arg mismatch error. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137120 Approved by: https://github.com/yanboliang ghstack dependencies: #137114, #137115, #137116, #137117	2024-10-07 18:55:26 +00:00
Michael Lazos	941be418d8	[Dynamo] Ensure torch function modes are dispatched on builtin ops (#137117 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137117 Approved by: https://github.com/yanboliang, https://github.com/williamwen42 ghstack dependencies: #137114, #137115, #137116	2024-10-07 18:55:26 +00:00
Michael Lazos	f9d69cde88	[Dynamo] Remove ignored modes from torch function mode stack guard (#135503 ) (#137116 ) Approved by: https://github.com/anijain2305 ghstack dependencies: #134732, #133137, #135443, #135444, #135422, #135502 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137116 Approved by: https://github.com/yanboliang ghstack dependencies: #137114, #137115	2024-10-07 18:55:26 +00:00
Michael Lazos	b1fd7708bd	[Dynamo] Remove ignored modes workaround (#135502 ) (#137115 ) Approved by: https://github.com/anijain2305 ghstack dependencies: #134732, #133137, #135443, #135444, #135422 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137115 Approved by: https://github.com/yanboliang ghstack dependencies: #137114	2024-10-07 18:55:26 +00:00
Michael Lazos	51bc839b94	[Dynamo] Trace enter/exit of TorchFunctionModes (#135422 ) (#137114 ) This PR implements tracing of with contexts with TorchFunction modes which have the default enter/exit behavior (ie pushing/popping the mode) Typically the bytecode for a context manager looks like this during a graph break: 1. graph call 2. enter context 3. unsupported code 4. exit context 5. resume call resume fn structure: 1. enter context 2. jump ... 3. exit context The issue with torch function modes is that side effects will replay any mutations to the torch function stack performed during tracing. So, we do not need to enter and exit around the unsupported code in the original function (doing so would result in a duplicate torch function mode entry during execution of the unsupported code), and we don't need to enter again in the resume function (the mode that was pushed from the side effects bytecode would still be on the stack). So for torch function modes the structure of our output code is this: 1. graph call 2. mutate tf mode stack to replay mutations 4. unsupported code 5. on exception restore stack 6. resume function Then our resume fn looks like this: 1. no-op enter torch function mode 2. jump 3. exit tf mode To implement the no-op enter of the torch function mode I added torch function mode in polyfill which no-op enters, but normally exits. This is needed because we still want to trace the with context in the resume function, and exit properly (the exit instructions will still be in the function, so we need to generate instructions to set up the context). Separately from the bytecode, dynamo also tracks contexts on the block stack, which is how the SETUP_* instructions are implemented. Naturally at a graph break, we exit these block stacks to properly reset the contexts entirely, so that we can re-enter around the unsupported code soundly. However once again, in the torch function mode case, in the event of a graph we do not want to perform any exit side effects because we want to preserve the state of the mode stack as is so that we will properly update the stack with bytecode mentioned in the first section. If we exited here, dynamo would pop the mode off of the symbolic stack, and not update the true python torch function mode stack with the suffix bytecode. All in all, for torch function modes we enter exactly once, update the global torch function mode stack with side effects bytecode, re-read this stack when compiling the resume function, and exit exactly once in the resume function. This matches the semantics of eager exactly. Approved by: https://github.com/williamwen42 ghstack dependencies: #134732, #133137, #135443, #135444 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137114 Approved by: https://github.com/yanboliang	2024-10-07 18:55:26 +00:00
Bob Ren	ff95ff5d38	type _dynamo/profiler.py (#137351 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137351 Approved by: https://github.com/Skylion007	2024-10-07 18:54:33 +00:00
Andrew Gu	aa145dead8	[FSDP2] Fixed mistargeted backward prefetch (#137348 ) If there is an `unshard` (top-half) without a `wait_for_unshard` (bottom-half), then the next iteration's `unshard` will be a no-op. This can unexpectedly not propagate the optimizer update on the sharded parameters to the unsharded parameters, so it is better to clear that `unshard` at the end of backward. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137348 Approved by: https://github.com/weifengpy	2024-10-07 18:10:09 +00:00
PyTorch MergeBot	01c07e7864	Revert "[BE][Ez]: Update cudnn_frontend submodule to v1.7.0 (#136920 )" This reverts commit 8dddd456794f82db5b4e807e9aed1919d3a832da. Reverted https://github.com/pytorch/pytorch/pull/136920 on behalf of https://github.com/drisspg due to Breaks sdpa with bias support, will switch to newer patch version when released ([comment](https://github.com/pytorch/pytorch/pull/136920#issuecomment-2397548622))	2024-10-07 17:56:57 +00:00
cyy	0c0d8c8ff0	[1/N] Fix extra warnings brought by clang-tidy-17 (#137407 ) Before we can use clang-tidy-17 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137407 Approved by: https://github.com/Skylion007, https://github.com/aaronenyeshi	2024-10-07 17:53:59 +00:00
Rachel Guo	ceb4ed8450	[AOTI][Tooling][10/n] Add scalar and symbolic type input debug printing support (#137323 ) Summary: - Further added more types for debug value dumping. - Add a test case for symint inputs for debug printer. in real prod model use cases, "unbacked symints" (those 'u0', 's0', etc.) can be helpful if we can examine their value. Test Plan: ``` AOT_INDUCTOR_DEBUG_INTERMEDIATE_VALUE_PRINTER=2 TORCHINDUCTOR_FORCE_DISABLE_CACHES=1 TORCHINDUCTOR_ABI_COMPATIBLE=1 TORCH_COMPILE_DEBUG=1 TORCH_LOGS="+graph, inductor, +schedule, output_code" buck2 run -c fbcode.enable_gpu_sections=true -c fbcode.nvcc_arch=h100 @//mode/opt fbcode//caffe2/test/inductor:test_aot_inductor -- -r test_aoti_debug_printer_sym_inputs_abi_compatible_cuda ``` Differential Revision: D63864708 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137323 Approved by: https://github.com/chenyang78	2024-10-07 17:41:40 +00:00
Animesh Jain	04e48ac562	[inductor] Refactor prefix to make it easy to create subclass of PythonWrapper (#137198 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137198 Approved by: https://github.com/jansel ghstack dependencies: #137191, #137193	2024-10-07 17:20:58 +00:00
Animesh Jain	e2b72348d0	[inductor] Reuse the subgraph if accessed via same get_attr node (#137193 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137193 Approved by: https://github.com/jansel ghstack dependencies: #137191	2024-10-07 17:20:58 +00:00
Animesh Jain	7a5eaecd92	[inductor] Correctly keep track of the graph_input_names (#137191 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137191 Approved by: https://github.com/jansel	2024-10-07 17:20:53 +00:00
Wei Feng	14b4099521	[FSDP2] support torch._foreach_copy_(float8) for fully_shard(Float8Linear) (#135955 ) this PR unblocks unit test with single Float8Linear module. It fixes following error ``` torch._foreach_copy_(foreach_copy_dsts, all_gather_inputs) [rank0]:E0913 13:44:29.829000 2179476 torch/testing/_internal/common_distributed.py:671] RuntimeError: "foreach_tensor_copy" not implemented for 'Float8_e4m3fn' ``` Differential Revision: [D63961071](https://our.internmc.facebook.com/intern/diff/D63961071) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135955 Approved by: https://github.com/vkuzo, https://github.com/eqy	2024-10-07 16:36:31 +00:00
Oguz Ulgen	33461592e2	[TLParse] Include cache hit/miss/bypass in the report name (#137282 ) Makes it easier to tell on glance https://interncache-all.fbcdn.net/manifold/tlparse_reports/tree/logs/.tmp1xoGc1/index.html <img width="398" alt="image" src="https://github.com/user-attachments/assets/7ed111cb-46d8-4442-a1b2-037d0a8decd8"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/137282 Approved by: https://github.com/jamesjwu	2024-10-07 16:00:00 +00:00
James Wu	4db199f15f	Implement Remote AOTAutogradCache (#137278 ) Summary: Implement Remote AOTAutogradCache. It uses all the same tech as Remote FXGraphCache, just with its own name. Test Plan: Run benchmark: TORCHINDUCTOR_AUTOGRAD_REMOTE_CACHE=1 TORCHINDUCTOR_FX_GRAPH_REMOTE_CACHE=1 TORCHINDUCTOR_AUTOGRAD_CACHE=0 TORCHINDUCTOR_FX_GRAPH_CACHE=0 TORCH_LOGS=+torch._functorch._aot_autograd.autograd_cache buck run mode/opt benchmarks/dynamo:torchbench -- --training --backend=inductor --only nanogpt --repeat 5 --performance --cold-start-latency See that it cache hits even with local cache removed. Results show up in remote cache logs https://fburl.com/scuba/pt2_remote_cache/5893dbaj New unit tests Reviewed By: oulgen Differential Revision: D63323958 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137278 Approved by: https://github.com/oulgen	2024-10-07 15:38:54 +00:00
Angela Yi	f80ed0b831	[export] Custom op meta kernel generation (two pass) (#137277 ) Summary: Prototyping the custom op meta kernel generation. Rest of the changes are in fbcode/scripts/angelayi Test Plan: followup diff (D63837739) Differential Revision: D63837740 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137277 Approved by: https://github.com/zou3519	2024-10-07 15:34:19 +00:00
Joshua Rosenkranz	e20e7a8c38	Fixed developer setup issue in open_registration_extension (#137355 ) This PR fixes an issue where when running `python setup.py develop`, the `open_registration_extension` self contained example will not build due to the following: ``` error: 'synchronizeStream' overrides a member function but is not marked 'override' ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/137355 Approved by: https://github.com/albanD, https://github.com/spzala	2024-10-07 15:25:37 +00:00
Yuxin Wu	8c3ab21490	multiprocessing.spawn: allow a grace period when shutdown (#131278 ) When one process fails, others are immediately killed. This prevents other processes to do necessary cleanups, or dump debug information (in particular, the NCCL flight recorder). This PR adds a grace period. Default behavior is unchanged. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131278 Approved by: https://github.com/albanD	2024-10-07 12:37:34 +00:00
vasiliy	a063a82c8b	[redo] Fp8 support for item() with cuda, index_select, and fill_ cpu (#137341 ) Summary: Redo of https://github.com/pytorch/pytorch/pull/128780, easier to copy-paste. Test Plan: CI Reviewers: Subscribers: Tasks: Tags: Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/137341 Approved by: https://github.com/eqy	2024-10-07 00:58:51 +00:00
Nikita Shulga	d1b87e26e5	[BE] Delete empty files (#137376 ) Discovered by running ``` % find aten -type f -size 0 aten/src/ATen/native/quantized/cpu/qnnpack/wrappers/dummy.c aten/src/ATen/native/vulkan/api/StringUtil.cpp aten/src/ATen/native/LegacyBridge.cpp aten/src/ATen/function_wrapper.py aten/src/ATen/cudnn/Exceptions.h ``` Most of them were added by `b774ce54f8` Remove reference to LegacyBridge.cpp from `aten_native_source_non_codegen_list`: `f42f63ee86/build_variables.bzl (L1317)` And reference to `native/quantized/cpu/qnnpack/wrappers/dummy.c` from `f42f63ee86/aten/src/ATen/native/quantized/cpu/qnnpack/buckbuild.bzl (L440)` Which seems to be a bug from some ancient Android toolchain Pull Request resolved: https://github.com/pytorch/pytorch/pull/137376 Approved by: https://github.com/kit1980, https://github.com/eqy, https://github.com/seemethere, https://github.com/jianyuh, https://github.com/Skylion007	2024-10-06 18:59:04 +00:00
Menglu Yu	0eba7e5451	Revert runtime numeric check in inductor due to increased compilation time (#137324 ) Summary: This diff reverts D63438718 Cause latency regression on multiple models Test Plan: NA Differential Revision: D63872515 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137324 Approved by: https://github.com/xuzhao9	2024-10-06 05:23:24 +00:00
angelayi	1dc1b85714	[export] Move swap to a different file (#137134 ) Refactor so that unflattener doesn't become too messy Differential Revision: [D63719648](https://our.internmc.facebook.com/intern/diff/D63719648/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137134 Approved by: https://github.com/avikchaudhuri ghstack dependencies: #136191, #137102	2024-10-06 04:28:18 +00:00
angelayi	fa9cd46d12	[export] Update swap's forward function (#137102 ) Downstream APS code was failing to run the previously swapped module because of some fx.GraphModule forward function weirdness (P1594789677). So to fix this, I just attached a custom forward function which matches the unflattened module's forward function. Differential Revision: [D63683422](https://our.internmc.facebook.com/intern/diff/D63683422/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137102 Approved by: https://github.com/avikchaudhuri ghstack dependencies: #136191	2024-10-06 04:25:36 +00:00
angelayi	52d7704b32	[export] Add optimization passes (#136191 ) Added an optimization pass to the swap function which removes extraneous pytrees. Currently it removes the pytree flatten/unflatten calls between modules in very specific scenarios (all the inputs of one module go into the other). Future work can be to remove the input pytree.flatten if the inputs go directly into an unflatten, and output pytree unflatten if the outputs are directly from a pytree.flatten. Differential Revision: [D62879820](https://our.internmc.facebook.com/intern/diff/D62879820) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136191 Approved by: https://github.com/avikchaudhuri	2024-10-06 04:22:42 +00:00
Jeeja	ad4e91acfe	[fsdp2] based on device, use stream and Event (#136843 ) currently FSDP2 support only CUDA, for other backends that need to use FSDP2 it won’t work as stream and events are based on CUDA. To support other backends, use _get_device_handle by device type to get the class and use this for stream and events. Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/136843 Approved by: https://github.com/awgu	2024-10-06 04:17:47 +00:00
Jez Ng	4061910ba2	Have Triton CPU backend respect max_autotune setting (#137276 ) We would previously do it regardless of the setting's value. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137276 Approved by: https://github.com/jansel, https://github.com/desertfire	2024-10-06 03:03:33 +00:00
Yanbo Liang	711dacf984	[FlexAttention] Support training bias for eager (#136910 ) Add training bias eager implementation, take over the original POC from https://github.com/pytorch/pytorch/pull/136076 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136910 Approved by: https://github.com/Chillee	2024-10-05 19:34:57 +00:00
Bob Ren	d073223663	turn CompilationCallbackHandler into dataclass (#137312 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137312 Approved by: https://github.com/Skylion007 ghstack dependencies: #137181	2024-10-05 19:03:28 +00:00
Catherine Lee	f54e142c58	Remove references to Rockset in trymerge (#137207 ) For the migration to ClickHouse But also Rockset is not used in trymerge anymore Pull Request resolved: https://github.com/pytorch/pytorch/pull/137207 Approved by: https://github.com/huydhn, https://github.com/ZainRizvi	2024-10-05 12:53:22 +00:00
Jeff Daily	40d8260745	[ROCm] remove caffe2 from hipify (#137157 ) - Remove all "MasqueradingAsCUDA" files and classes. - Do not rename "CUDA" classes to "HIP". Pull Request resolved: https://github.com/pytorch/pytorch/pull/137157 Approved by: https://github.com/eqy	2024-10-05 12:48:54 +00:00
Yanbo Liang	ca38f28543	[FlexAttention] Adjust BlockMask if reusing the one created at larger seqlen (#137255 ) Fixes #136232 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137255 Approved by: https://github.com/Chillee	2024-10-05 07:31:32 +00:00
Nikita Shulga	4830bd0dd4	[Doc] Clarify that NaNs are not equal to each other (#137386 ) Fixes https://github.com/pytorch/pytorch/issues/137337 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137386 Approved by: https://github.com/janeyx99, https://github.com/huydhn, https://github.com/kit1980	2024-10-05 06:19:12 +00:00
Avik Chaudhuri	17718209ea	fix specialization bug in unflatten + preserve_module_call_signature (#137363 ) Summary: In unflatten, when we generate module calls when their signature has been preserved, we do not pass the original constant args. This can cause strange effects, e.g., if the module is swapped out with itself, we may suddenly go down a different path than the original, or even crash. Test Plan: added a test Reviewed By: angelayi Differential Revision: D63913750 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137363 Approved by: https://github.com/angelayi	2024-10-05 04:26:02 +00:00
Nikita Shulga	6d0d7b6e37	[CI][BE] Restore cuda memory allocator setting (#137383 ) By adding `finally:` clause at the end of the test Might fix https://github.com/pytorch/pytorch/issues/137098#issuecomment-2389172552 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137383 Approved by: https://github.com/ngimel	2024-10-05 04:16:38 +00:00
PyTorch UpdateBot	0067f586ba	[audio hash update] update the pinned audio hash (#136968 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned audio hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136968 Approved by: https://github.com/pytorchbot	2024-10-05 04:08:59 +00:00
Yuanhao Ji	4d8b845797	Fix overflow error when `torch.bincount()` handles a large tensor (#136745 ) Fixes #136720 the result in this case says: ``` Traceback (most recent call last): File "/Users/shenke/workspace/pytorch/mytest.py", line 9, in <module> result = torch.bincount(input) ^^^^^^^^^^^^^^^^^^^^^ RuntimeError: maximum value of input overflowed, it should be < 9223372036854775807 but got 9223372036854775807 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/136745 Approved by: https://github.com/Skylion007	2024-10-05 04:04:48 +00:00
soulitzer	d6f340f66c	Determine autograd engine ready queue based on InputMetadata instead of InputBuffer (#135633 ) Thanks @awgu for raising this issue and the small repro From offline discussion with @albanD, in the case where a forward returns multiple outputs with different devices, we'd want to select the ready queue based on the device of the first one. Even though this is somewhat arbitrary, we prefer this over deciding which ready queue to push based on whichever input buffer's we happen to compute last, which can vary depending on more factors and thus be harder to reason about. This is in theory bc-breaking, but it seems unlikely that someone would depend on this behavior. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135633 Approved by: https://github.com/albanD	2024-10-04 23:59:46 +00:00
Michal Gallus	79562f3af8	[ROCm] Modify hipify script to work with Windows paths (#135360 ) This change modifies the `hipify_python.py` script to properly detect all directories, `include` and `ignore` paths during hipification process on Windows, by changing the path syntax convention to a UNIX-like one. Since in many places the script assumes a UNIX-like convention by using paths with forward slashes `/`, I decided to accommodate for it by converting Windows paths to UNIX-like ones. By doing it so, the number of changes to the file is limited. Moreover this early-on unification allows for the rest of the code to have a battle-tested linux-like behaviour. Another option would be to use `Path` object from `pathlib` to represent all paths in the script, however, it would impact a broader share of a code and would hence require a more meticulous evaluation in terms of non-altered logic and edge cases. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135360 Approved by: https://github.com/jeffdaily, https://github.com/jithunnair-amd	2024-10-04 23:43:43 +00:00
albanD	8b6774d381	Clarify comment for error handling of dict getattr (#137381 ) Just a small nit Pull Request resolved: https://github.com/pytorch/pytorch/pull/137381 Approved by: https://github.com/malfet	2024-10-04 23:40:21 +00:00
Tarun Karuturi	f42f63ee86	Add option to disable operator profiling (#136838 ) Summary: X-link: https://github.com/pytorch/executorch/pull/5720 For smaller models the overhead of profiling ops might be prohibitively large (distorting the inference execution time significantly) so we provide users an option to disable op profiling and essentially only profile the important events such as inference execution time. To disable operator profiling users need to do: ``` etdump_gen.set_event_tracer_profiling_level(executorch::runtime::EventTracerProfilingLevel::kNoOperatorProfiling); ``` Test Plan: Added test case. Differential Revision: D61883224 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136838 Approved by: https://github.com/dbort	2024-10-04 22:56:00 +00:00
Andrew Ho	f2d174c051	Update CODEOWNERS (#136278 ) Swap @gokulavasan for @divyanshk as codeowner of torch/utils/data/ Pull Request resolved: https://github.com/pytorch/pytorch/pull/136278 Approved by: https://github.com/divyanshk, https://github.com/janeyx99, https://github.com/jansel	2024-10-04 22:42:05 +00:00
albanD	88e54de219	More nogil unsafe API fix (#137142 ) Cover the PyDict APIs and confirms no update needed for PyModule one. The rest was already covered in https://github.com/pytorch/pytorch/pull/136899 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137142 Approved by: https://github.com/eqy, https://github.com/Skylion007	2024-10-04 21:56:34 +00:00
Siddharth Kotapati	e27c0048db	Enable additional tests for MPS CI runs (#134356 ) As part of the follow up for https://github.com/pytorch/pytorch/issues/133520, adapting existing unused tests for use in MPS CI runs. Focusing on nhwc & other memory formatting tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/134356 Approved by: https://github.com/malfet, https://github.com/eqy, https://github.com/huydhn	2024-10-04 21:52:38 +00:00
Laith Sakka	7c1d93944e	Proper handling of arguments passed by in kwargs inside zip_schema (#137311 ) if the function is ```func(a, b, c) ``` and is called as ```func(a=1, b=.., c=..)``` before this change we do not iterate on the a, b, c, since those appear in kwargs this diff fix that issue. This function is used in _inductor/ir.py to iterate over custom op arguments and when a custom pass does changes and pass arguments as kwargs, we do not process them. ``` for info, arg in torch._library.utils.zip_schema(schema, args, kwargs): handle_aliasing_and_mutation(info, arg) ``` Fix https://github.com/pytorch/pytorch/issues/137057 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137311 Approved by: https://github.com/zou3519	2024-10-04 21:50:31 +00:00
albanD	c0deec120f	Fix resurrection logic to trigger early enough (#137267 ) Fixes https://github.com/pytorch/pytorch/issues/136358 The bug here is that the Tensor object is actually 2 classes: `Tensor` from `_tensor.py` and `TensorBase` from c++. Before this PR, they have the following gc methods: Tensor: - tp_clear subtype_clear - tp_traverse THPVariable_subclass_traverse - tp_dealloc THPVariable_subclass_dealloc TensorBase: - tp_clear THPVariable_clear - tp_traverse THPFunction_traverse (fake function that is just an error) - tp_dealloc object_dealloc The problem is that when clear is called on the Tensor, subtype_clear is going to clear the things owned by the "Tensor" type, in particular, its `__dict__` attribute, before delegating to the TensorBase clear where we detect that resurrection needs to happen and skip it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137267 Approved by: https://github.com/ezyang, https://github.com/kshitij12345	2024-10-04 21:13:54 +00:00
Nikita Shulga	bd48933323	Run docker builds on Meta account for now (#137358 ) To fix ``` arn:aws:sts::391835788720:assumed-role/ghci-lf-github-action-runners-runner-role/i-096a3e2616140518b is not authorized to perform: ecr:InitiateLayerUpload on resource: arn:aws:ecr:us-east-1:308535385114:repository/pytorch/pytorch-linux-jammy-py3-clang18-asan because no resource-based policy allows the ecr:InitiateLayerUpload action ``` Which seems to be doing the trick see https://github.com/pytorch/pytorch/actions/runs/11185419440/job/31098258344 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137358 Approved by: https://github.com/huydhn	2024-10-04 20:39:56 +00:00
Andrew Gu	7b3378a39a	[FSDP2] Relaxed even sharding requirement for all-gather extensions (#137005 ) This PR relaxes the even sharding requirement for the all-gather extensions. The `fsdp_pre_all_gather` now expects signature: ```diff def fsdp_pre_all_gather( self, mesh: DeviceMesh, + outer_size: torch.Size, + outer_stride: Tuple[int, ...], module: nn.Module, mp_policy: MixedPrecisionPolicy, ) -> Tuple[Tuple[torch.Tensor, ...], Any]: ``` - Since no one is using this new signature yet, we should be safe to change it. - Currently, the `outer_stride` will always be contiguous strides since FSDP2 only supports contiguous strides for now. - For the uneven sharding case, the user is responsible to return a padded sharded tensor from `fsdp_pre_all_gather`. This is risky territory because if the user does not do so, then this may manifest as a NCCL timeout, as only the ranks with padding will error out. However, I am not aware of any way around this. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137005 Approved by: https://github.com/weifengpy	2024-10-04 20:34:20 +00:00
Bob Ren	f4b415da11	type _dynamo/replay_record.py (#137183 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137183 Approved by: https://github.com/Skylion007	2024-10-04 20:29:24 +00:00
Avik Chaudhuri	6a6a8b17b8	handle state tensors in training ir path (#137240 ) Summary: We had attribute assignment detection and handling of registered buffer assignments when using `aot_autograd`, but not when using just `make_fx`. Fixed. Test Plan: expanded coverage of `test_state_tensors` to use `export` instead of `torch.export.export` Differential Revision: D63802576 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137240 Approved by: https://github.com/tugsbayasgalan	2024-10-04 20:23:48 +00:00
Bob Ren	f0ef7fddde	Add ignored/unmaintained comment for capture_autograd_function flag (#137309 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137309 Approved by: https://github.com/jansel ghstack dependencies: #136961	2024-10-04 20:02:37 +00:00
Bin Bao	0878739b11	[AOTI] Add C shim for MKLDNN _convolution_pointwise (#137269 ) Differential Revision: [D63875271](https://our.internmc.facebook.com/intern/diff/D63875271) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137269 Approved by: https://github.com/chenyang78, https://github.com/hl475	2024-10-04 19:42:05 +00:00
Benjamin Glass	a968576777	Add lowering for aten.searchsorted (#135701 ) Adds lowering for `aten.searchsorted`. This entails: 1. Adding support for multi-dimensional bucket tensors to `ops.bucketize`. 2. Adding support for striding to `ops.bucketize`. 3. Adding support for sorting tensors to `ops.bucketize`. 4. Adding a lowering for `aten.searchsorted.Tensor`. 5. Adding a basic decomposition for `aten.searchsorted.Scalar` that calls into the lowering for tensors. 6. Updating the meta-function for `aten.searchsorted` to properly check some of the sizing conditions. Closes #135873 Differential Revision: [D63766514](https://our.internmc.facebook.com/intern/diff/D63766514) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135701 Approved by: https://github.com/amjames, https://github.com/eellison, https://github.com/davidberard98	2024-10-04 19:26:05 +00:00
eellison	58ec6a360c	force contiguity for all reduce (#137345 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137345 Approved by: https://github.com/xmfan	2024-10-04 19:16:38 +00:00
Shangdi Yu	c0a930b104	Change to export_for_training in quantize_pt2e tests (#137233 ) Summary: as title also change it in `prepare_pt2e()` docstring Test Plan: ``` buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test:quantization_pt2e_qat buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test/quantization:test_quantization ``` Differential Revision: D63345059 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137233 Approved by: https://github.com/tugsbayasgalan	2024-10-04 18:33:02 +00:00
Michael Lazos	22e19bd2d7	Add link to torch.compile the missing manual in troubleshooting (#137301 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/137301 Approved by: https://github.com/svekars Co-authored-by: Svetlana Karslioglu <svekars@meta.com>	2024-10-04 18:19:30 +00:00
Henry Tsang	79195b9453	[inductor] Add kwargs to bypass unexpected keyword argument error (#137329 ) Summary: I tried `TORCHINDUCTOR_PROFILE=1 TORCHINDUCTOR_PROFILE_OUTPUT=~/fbcode/profile.txt`. TypeError: DebugAutotuner.run() got an unexpected keyword argument 'benchmark_run' Test Plan: ci Differential Revision: D63876103 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137329 Approved by: https://github.com/muchulee8	2024-10-04 18:17:56 +00:00
Tugsbayasgalan Manlaibaatar	d2d14d14e3	[RELAND] Fix unlift to preserve aliased constants (#137310 ) Differential Revision: [D63864743](https://our.internmc.facebook.com/intern/diff/D63864743) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137310 Approved by: https://github.com/avikchaudhuri	2024-10-04 18:15:52 +00:00
Laith Sakka	8b9cbf22c2	Enable regression test for add loop benchmarks (#136573 ) The red dotted line is 1.5 <img width="1607" alt="Screenshot 2024-09-24 at 11 50 41 AM" src="https://github.com/user-attachments/assets/719a9a86-89af-4c58-8723-80a28f9bb517"> expected taken from the average. <img width="850" alt="Screenshot 2024-09-24 at 2 33 27 PM" src="https://github.com/user-attachments/assets/0f25e855-35ae-4031-86ef-1452ef6598de"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/136573 Approved by: https://github.com/ezyang	2024-10-04 18:12:08 +00:00
Menglu Yu	ad240018f2	[PT2][Inductor][Reliability] Add back unit test for pad_mm with BF16 (#137231 ) Summary: We added the unit test for recent added pad_mm pattern in customized optimus D63040455, where it will resolve the long computation kernel issue for BF16 on A100. Test Plan: ``` buck2 test mode/opt //caffe2/test/inductor:pad_mm -- test_pad_mm_bf16 ``` Buck UI: https://www.internalfb.com/buck2/4dd4c90c-4a2a-4859-923c-a4008f78a1cd Test UI: https://www.internalfb.com/intern/testinfra/testrun/9851624237127136 Network: Up: 100KiB Down: 4.3GiB (reSessionID-87f11454-d920-47af-9af5-39ca0572b7c6) Jobs completed: 7079. Time elapsed: 3:34.3s. Cache hits: 99%. Commands: 7061 (cached: 7024, remote: 19, local: 18) Tests finished: Pass 2. Fail 0. Fatal 0. Skip 0. Build failure 0 Differential Revision: D63794727 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137231 Approved by: https://github.com/henrylhtsang	2024-10-04 17:49:55 +00:00
Shangdi Yu	b2979f4382	Allow autocast in training ir export (#137287 ) Summary: hardcode "val" field for autocast (similar to set_grad_enabled), to bypass the verifier check. Test Plan: CI Differential Revision: D63345767 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137287 Approved by: https://github.com/angelayi	2024-10-04 17:38:51 +00:00
Colin Peppler	42adadf2f2	[aotinductor] enable CUTLASS backend (#134379 ) ### Context This PR allows CUTLASS kernels usage in AOTI. It does this by: * For any CUTLASS kernels that win during autotuning, compile them as a .so & .o * When creating the final model .so, link all the CUTLASS kernels .o files * Make sure we codegen things correctly (argument dtypes and specify extern "C" linking for the CUTLASS kernel) ### Example https://gist.github.com/ColinPeppler/e834fa2255c37e9444b6d540bf7bd04d#file-model-cpp-L548-L549 ``` TORCH_LOGS="+output_code" python test/inductor/test_cutlass_backend.py -v -k test_max_autotune_cutlass_backend_regular_mm ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/134379 Approved by: https://github.com/tenpercent, https://github.com/chenyang78	2024-10-04 17:32:41 +00:00
Jeff Daily	c7b0d4b148	raw_alloc ignores PYTORCH_NO_CUDA_MEMORY_CACHING (#131114 ) raw_alloc is used by cudnn, miopen, thrust, and tunableop. Without this PR, the env var for disabling the caching allocator will only partially work. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131114 Approved by: https://github.com/eqy, https://github.com/houseroad, https://github.com/albanD Co-authored-by: Nichols A. Romero <nick.romero@amd.com>	2024-10-04 15:36:29 +00:00
cyy	67908e9111	Enable clang-tidy on torch/csrc/distributed/rpc (#137320 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/137320 Approved by: https://github.com/Skylion007	2024-10-04 15:34:05 +00:00
Bin Bao	15c3479db7	[AOTI] Fix _scaled_mm ABI-compatible codegen (#137132 ) Summary: Similar to https://github.com/pytorch/pytorch/pull/137008, but for supporting _scaled_mm in the ABI-compatible mode. Differential Revision: [D63757729](https://our.internmc.facebook.com/intern/diff/D63757729) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137132 Approved by: https://github.com/chenyang78 ghstack dependencies: #137008	2024-10-04 14:05:18 +00:00
Bin Bao	5d24ea81d3	[AOTI] Fix cpp wrapper codegen for _scaled_mm (#137008 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/136209. Because _scaled_mm has an out variant, the generated cpp fallback call should call _scaled_mm_out. The ABI-compatible mode needs more work. Differential Revision: [D63757728](https://our.internmc.facebook.com/intern/diff/D63757728) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137008 Approved by: https://github.com/hl475	2024-10-04 14:02:46 +00:00
PyTorch MergeBot	f56f7476d3	Revert "Add meta functions for `lerp`, `addcmul`, and `addcdiv`. (#136909 )" This reverts commit e4b98b11493914769d15ca8b124c0b5fa1fdd364. Reverted https://github.com/pytorch/pytorch/pull/136909 on behalf of https://github.com/albanD due to breaks trunk jobs ([comment](https://github.com/pytorch/pytorch/pull/136909#issuecomment-2393774694))	2024-10-04 14:01:54 +00:00
PyTorch MergeBot	cd17b2645c	Revert "[Distributed] Fix extra context on device 0 (#135273 )" This reverts commit a93d3873e97973fbc0009245579ee4e4fa7f155a. Reverted https://github.com/pytorch/pytorch/pull/135273 on behalf of https://github.com/albanD due to Broken trunk distributed ci ([comment](https://github.com/pytorch/pytorch/pull/135273#issuecomment-2393772987))	2024-10-04 13:58:57 +00:00
PyTorch MergeBot	5509207543	Revert "[PyTorch] Port ExecuTorch bfdot improvement back to ATen BlasKernel (#136331 )" This reverts commit 592e3a3d4069029946ec4c8d103a313806c53a88. Reverted https://github.com/pytorch/pytorch/pull/136331 on behalf of https://github.com/albanD due to Breaks aarch64 builds, see link below ([comment](https://github.com/pytorch/pytorch/pull/136331#issuecomment-2393760135))	2024-10-04 13:52:37 +00:00
Adnan Akhundov	e80f47fb4d	Pass special arguments to user-defined Triton kernels if required (#137236 ) Summary: Special autotuning configs like `num_warps` and `num_stages` can be passed to the kernel as parameters. The `config.all_kwargs()` call [here](`762a7d197c/python/triton/runtime/autotuner.py (L106)`) in the Trtion code includes those special configs (names and values) into the potential arguments to the kernel. [Here](`762a7d197c/python/triton/runtime/jit.py (L613)`) some of those may be included in actual kenrel arguments, given that their names are present among the kernel parameters. This PR replicates this behavior in user-defined Triton kernel compilation in PT2. Resolves #136550. Test Plan: ``` $ python test/inductor/test_triton_kernels.py -k test_triton_kernel_special_params inductor [] inline_call [] stats [('calls_captured', 2), ('unique_graphs', 1)] aot_autograd [('total', 1), ('ok', 1)] .inductor [] inline_call [] stats [('calls_captured', 2), ('unique_graphs', 1)] .inductor [('fxgraph_cache_bypass', 1), ('pattern_matcher_count', 1), ('pattern_matcher_nodes', 1), ('extern_calls', 1), ('possibly_missed_reinplacing_opportunities', 0), ('possibly_missed_reinplacing_bytes', 0)] inline_call [] stats [('calls_captured', 2), ('unique_graphs', 1)] aot_autograd [('total', 1), ('ok', 1)] .inductor [] inline_call [] stats [('calls_captured', 2), ('unique_graphs', 1)] aot_autograd [('total', 1), ('ok', 1)] .inductor [] inline_call [] stats [('calls_captured', 2), ('unique_graphs', 1)] .inductor [('benchmarking.TritonBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_bypass', 1), ('pattern_matcher_count', 1), ('pattern_matcher_nodes', 1), ('extern_calls', 1), ('benchmarking.TritonBenchmarker.triton_do_bench', 1), ('possibly_missed_reinplacing_opportunities', 0), ('possibly_missed_reinplacing_bytes', 0)] inline_call [] stats [('calls_captured', 2), ('unique_graphs', 1)] aot_autograd [('total', 1), ('ok', 1)] . ---------------------------------------------------------------------- Ran 6 tests in 6.283s OK ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/137236 Approved by: https://github.com/zou3519	2024-10-04 07:36:55 +00:00
cyy	6327a71880	[Environment Variable][2/N] Use thread-safe setenv wrapper (#124485 ) This follows #119449 to make setenv thread-safe. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124485 Approved by: https://github.com/eqy	2024-10-04 07:30:51 +00:00
Pian Pawakapan	6dcd773c57	[export] clean up dynamic markers from tensors (#137230 ) Summary: When we handle dynamic shapes markers like `Dim.AUTO, Dim.DYNAMIC`, we use dynamo decorators, attaching set attributes to the export input tensors, e.g. `x._dynamo_dynamic_indices = set()`. I thought this was fine, since it's done all the time with torch.compile, but it breaks some PT2Inference tests, specifically because unpickling a set attribute isn't possible with the C++ torch::jit::pickle_load call. We've agreed that the PT2Inference side will clone sample inputs & pickle the original inputs to be safe, but this still establishes a nice invariant that user-facing decorators are both ignored & cleaned out in the lifecycle of an export call. Test Plan: test_export Differential Revision: D63773534 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137230 Approved by: https://github.com/avikchaudhuri	2024-10-04 06:50:45 +00:00
Yanbo Liang	a408cfcbf1	[torch.compile] torch.vmap supports dynamic shapes + enable flex attention create_block_mask dynamic shapes (#137163 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/137163 Approved by: https://github.com/Chillee	2024-10-04 05:16:04 +00:00
Mauricio Villegas	40b09edd87	Add back DistributedDataParallel types that were lost when pyi was removed (#136835 ) When the stub file `nn/parallel/distributed.pyi` was removed (#88701), some types that existed are no longer available. This pull request adds them back. Just for reference, these types are used in pytorch-lightning's LightningCLI. Command line interfaces are created automatically, and having type hints make them nicer. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136835 Approved by: https://github.com/kwen2501	2024-10-04 04:44:20 +00:00
Tugsbayasgalan Manlaibaatar	97634e4f82	Rollout infra for executorch migration to training IR (#132703 ) Title Differential Revision: [D60432217](https://our.internmc.facebook.com/intern/diff/D60432217/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132703 Approved by: https://github.com/tarun292	2024-10-04 04:33:08 +00:00
rzou	f500cb43bb	Fix torch.library.register_vmap (#137306 ) We didn't support multiple levels of vmap. The main problem is, during the batching rule, we need to exclude the vmap dispatch key (FuncTorchBatched) like how our C++ batching rules do it. Test Plan: - new test Pull Request resolved: https://github.com/pytorch/pytorch/pull/137306 Approved by: https://github.com/Chillee	2024-10-04 03:46:35 +00:00
Bob Ren	cfc51c858a	type _dynamo/callback.py (#137181 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137181 Approved by: https://github.com/Skylion007	2024-10-04 03:28:52 +00:00
PyTorch MergeBot	9670e9e5b0	Revert "Mark PyTorch module as no-gil valid and pythoncapi_compat.h (#136899 )" This reverts commit 4f93de895138cc3cb8c4383b480a2d0ecf407e1b. Reverted https://github.com/pytorch/pytorch/pull/136899 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/136899#issuecomment-2392721534))	2024-10-04 03:28:31 +00:00
Yukio Siraichi	e4b98b1149	Add meta functions for `lerp`, `addcmul`, and `addcdiv`. (#136909 ) This PR adds new meta functions for `lerp`, `addcmul`, and `addcdiv` (including their respective inplace versions). These functions only had refs implementations, which was being the root cause of a significant overhead ([issue][1]) when running `AdamW` optimizer step on PyTorch/XLA backend. Running the meta functions resulted in the following improvements: - `lerp` calls: 1,550ms to 140ms (10x) - `addcdiv` calls: 640ms to 350ms (1.8x) - `addcmul` calls: 620ms to 300ms (2.05x) [1]: https://github.com/pytorch/xla/issues/7923 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136909 Approved by: https://github.com/jansel	2024-10-04 02:47:25 +00:00
Bob Ren	a1f1f585ab	clean up error_on_nested_jit_trace flag (#136961 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136961 Approved by: https://github.com/jansel	2024-10-04 02:07:54 +00:00
Yifu Wang	d32696249a	[IntraNodeComm] fix a race condition in one-shot all-reduce (#137257 ) One-shot all-reduce did not have a barrier at the end. It was possible for a rank to write to its p2p buffer for the next collective before another rank finished reading it for the previous collective. Also removing the fuse-input-copy optimization. The synchronization complexity probably outweighs the saving. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137257 Approved by: https://github.com/Chillee	2024-10-04 01:41:14 +00:00
Trung Truong	3d3b394e94	[MTIA](3/n) Implement CPU pins functions for MTIA hooks (#137283 ) Summary: Link CPU pins function in MTIA hooks to the host allocator implementation Test Plan: signals unit test in D63709424 Differential Revision: D63352770 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137283 Approved by: https://github.com/egienvalue	2024-10-04 01:26:21 +00:00
jakeharmon8	15e127bc3b	[numpy2.0 compat] Fix test_parse_numpy_int_overflow for NumPy 2.0 (#137135 ) NumPy now throws an OverflowError when trying to create np.uint64(-1) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137135 Approved by: https://github.com/Skylion007	2024-10-04 01:21:12 +00:00
Bob Ren	13ec343afe	clean up capture_func_transforms flag (#136960 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136960 Approved by: https://github.com/guilhermeleobas, https://github.com/jansel	2024-10-04 01:10:52 +00:00
cyyever	6b9b2a126e	Build clang18 image for ASAN tests (#128763 ) Use the latest clang. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128763 Approved by: https://github.com/malfet	2024-10-04 00:53:56 +00:00
Ke Wen	a93d3873e9	[Distributed] Fix extra context on device 0 (#135273 ) This PR contains multiple fixes for issue https://github.com/pytorch/pytorch/issues/135279: ## First part: Moves the GPU guard (`cudaSetDevice`) before the `currentStreamCaptureStatusMayInitCtx` call. As its name suggests, it May Init Ctx. ## Second part: Even with the above fix, additional contexts are still observed during Work object destruction, e.g. ``` work = dist.all_reduce(tensor, async_op=True) time.sleep(5) <-- no additional context yet del work <-- additional context shows up ``` ### Debug process Chasing it down to destruction of a `Future` object -- a member variable of `Work`. Then further down to the following member of `Future`: ``` std::vector<c10::Event> events_; ``` When the `events_` are destroyed, we hit the road down to: `1f3a793790/c10/cuda/impl/CUDAGuardImpl.h (L106-L121)` When there is no "preset" CUDA context (which is the case for python garbage collector), line 112: `c10::cuda::GetDevice(&orig_device)` will set `orig_device` to 0. Then, at line 120, `c10::cuda::SetDevice(orig_device)` will "officially" set the context to device 0 -- that's where rank 1, 2, ... can create extra context on device 0! ### Solution This PR adds an explicit destructor to `Future`. In this destructor, destroy each event with a device guard. ## Test Added test_extra_cuda_context, implemented via - `pynvml` (if available), or - memory consumption check. `python test/distributed/test_c10d_nccl.py -k test_extra_cuda_context` Pull Request resolved: https://github.com/pytorch/pytorch/pull/135273 Approved by: https://github.com/fduwjj, https://github.com/wconstab, https://github.com/eqy	2024-10-04 00:44:02 +00:00
Bin Bao	88e338f4dd	[AOTI] Add C shim for MKLDNN _linear_pointwise (#136999 ) Differential Revision: [D63851216](https://our.internmc.facebook.com/intern/diff/D63851216) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136999 Approved by: https://github.com/leslie-fang-intel, https://github.com/chenyang78, https://github.com/hl475	2024-10-04 00:35:10 +00:00
Nikita Shulga	57c02e5a00	[BE] Use helper functions in mps_extension (#137313 ) This should be a no-op change, i.e. it runs the same code, but replaces verbose ObjectiveC invocation with helper function from OperationUtils.h, which this example already depends on Pull Request resolved: https://github.com/pytorch/pytorch/pull/137313 Approved by: https://github.com/atalman	2024-10-04 00:26:38 +00:00
Colin Peppler	bc916a5537	[easy] for test_ck_backend enable RE & activate remaining tests for FBCode (#137305 ) Differential Revision: D63859208 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137305 Approved by: https://github.com/muchulee8, https://github.com/chenyang78	2024-10-04 00:22:35 +00:00
cyy	60d19cb59e	Enable clang-tidy on torch/csrc/distributed/autograd/* (#137180 ) Enable clang-tidy on `torch/csrc/distributed/autograd/*` directory. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137180 Approved by: https://github.com/Skylion007	2024-10-03 23:49:23 +00:00
rzou	7e13e7dd7e	Disallow FakeTensor.data_ptr access in eager mode (#137221 ) Previously we raised a deprecation warning (beginning PyTorch 2.4). Now that we are on 2.6, we're completing the deprecation and disallowing this behavior. Test Plan: - tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/137221 Approved by: https://github.com/albanD, https://github.com/eellison	2024-10-03 23:47:55 +00:00
Justin Chu	cfcd0e1fe9	[ONNX] Update the faketensor documentation (#137292 ) Update the faketensor documentation to reflect current usage. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137292 Approved by: https://github.com/shubhambhokare1, https://github.com/sdpython	2024-10-03 23:27:11 +00:00
Shangdi Yu	4096ed7dc2	Migrate to training ir in quantization_pt2e_qat unittests (#137232 ) Summary: Change capture_pre_autograd_graph to export_for_training in unit tests. Test Plan: ``` buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test:quantization_pt2e_qat ``` Reviewed By: tugsbayasgalan Differential Revision: D63336660 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137232 Approved by: https://github.com/angelayi	2024-10-03 22:57:04 +00:00
Nikita Shulga	b44f25e1ba	[CI] Move s390 binary build to its own workflow (#137304 ) It was added by https://github.com/pytorch/pytorch/pull/125399 and takes 3 hours to finish Considering limited number of runners, it often causes queueing see: <img width="402" alt="image" src="https://github.com/user-attachments/assets/5c67c1d6-af4c-4453-a089-aa1174513bfa"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/137304 Approved by: https://github.com/kit1980, https://github.com/huydhn, https://github.com/atalman	2024-10-03 22:31:36 +00:00
David Berard	54094c0c26	[inductor][user triton] Check size hints to determine indexing dtype (#137234 ) Previously, all integer inputs to user-defined triton kernels were assumed to be int32. This would result in errors if your input was actually an int64. This PR checks the value to determine which dtype to use for indexing: if it is known to be < int_max, then use int32 (and add guards if relevant); if we can't check (e.g. unbacked symint), then use int64. Differential Revision: [D63797975](https://our.internmc.facebook.com/intern/diff/D63797975) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137234 Approved by: https://github.com/eellison	2024-10-03 22:07:26 +00:00
Shangdi Yu	c83178d894	Change to export_for_training in XNNPACK tests (#137238 ) Summary: as title Test Plan: CI Differential Revision: D63344674 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137238 Approved by: https://github.com/tugsbayasgalan	2024-10-03 21:28:05 +00:00
angelayi	ce14f1f0c9	[aoti] Accept constant inputs (#137197 ) Fixes https://fb.workplace.com/groups/1028545332188949/posts/1056788036031345/?comment_id=1056790162697799&reply_comment_id=1057501845959964 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137197 Approved by: https://github.com/henrylhtsang, https://github.com/desertfire, https://github.com/hl475	2024-10-03 20:59:33 +00:00
eqy	46f158bfbc	[cuDNN] Check shapes during graph capture in cuDNN CTCLoss (#130071 ) Found out from #125952 about the existence of `_assert_async`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130071 Approved by: https://github.com/Skylion007 Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>	2024-10-03 20:10:28 +00:00
Scott Wolchok	592e3a3d40	[PyTorch] Port ExecuTorch bfdot improvement back to ATen BlasKernel (#136331 ) ExecuTorch's fork of BlasKernel.cpp grew bfdot support, complete with demonstration that it helps. Port it back to PyTorch. Supersedes https://github.com/pytorch/pytorch/pull/127488 . Includes https://github.com/pytorch/executorch/pull/5444 . Differential Revision: [D63045939](https://our.internmc.facebook.com/intern/diff/D63045939/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136331 Approved by: https://github.com/malfet, https://github.com/albanD ghstack dependencies: #136445	2024-10-03 18:18:37 +00:00
Scott Wolchok	c8a7da305b	[PyTorch] Add attribute version of C10_ALWAYS_INLINE (#136445 ) Sometimes (such as on a lambda), you need `__attribute__((always_inline))` but not `inline`. Differential Revision: [D63266917](https://our.internmc.facebook.com/intern/diff/D63266917/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136445 Approved by: https://github.com/malfet	2024-10-03 18:18:37 +00:00
PyTorch MergeBot	525f6715bc	Revert "Fix unlift to unblock training IR + run_decomp on aliasing constants (#137162 )" This reverts commit f96020c246aec8514b945d530879635a03294f70. Reverted https://github.com/pytorch/pytorch/pull/137162 on behalf of https://github.com/jovianjaison due to Sorry for reverting your changes but many jobs are failing with NameError: name _recursive_getattr is not defined + a Lint job fails ([comment](https://github.com/pytorch/pytorch/pull/137162#issuecomment-2392036062))	2024-10-03 18:17:56 +00:00
fduwjj	c7714b8d8d	[FR] Fix duplicate output for the case when not all ranks join on collective (#137256 ) As title, when testing on an internal case, we found that we have very similar output for the error when certain ranks does not join one collective. This is because we didn't put all ranks into `candidate_ranks` so that they didn't get wiped out from entries and gets checked again. Ideally for the given case, we should report this is an out of order case, because rank 0, 1 calls all-to-all while all the rest ranks call all-gather-base. But when we select entries to compare, we don't have global view of the entries. In the specific case, on rank 0 and 1, it has collective of PG 7 on entry 1130 with seq ID = 1130. However, on other ranks, they have collective of PG 0 on entry 1130 with seq ID = 2. It's hard to use entry idx to do the match because if we later consider p2p, this assumption will collapse, so we now still defer it for users or further down debugging stream to figure it out. To make the message clearer, I also include both seqID and record_id (aka, entry index) in the message. (That does not mean this is not possible to implement in the code, for example, we can let all record_id to minus the maximum p2p seq id before it; but users will easily see the wrong order, so we don't think it's necessary to have that logic now) P1626755348 Differential Revision: [D63815335](https://our.internmc.facebook.com/intern/diff/D63815335/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137256 Approved by: https://github.com/c-p-i-o	2024-10-03 18:06:45 +00:00
albanD	adc48a5b52	Python CAPI cleanup (#137266 ) This is unrelated to anything else, but as I was going through the code, fixing bad patterns and a refcount bug (which is unlikely to cause any real issue tbh) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137266 Approved by: https://github.com/Skylion007	2024-10-03 17:55:48 +00:00
Sam Larsen	8bb8c3997b	[inductor] parallel compile: add import of thread_safe_fork for internal (#137155 ) Summary: We had a report of crashes in parallel compile subprocesses linked to reading justknobs. See https://fburl.com/workplace/14a4mcbh internally. This is a known issue with justknobs. It looks like we don't have a lot of control over evaluating knobs. Some are read in inductor (`"pytorch/remote_cache:autotune_memcache_version`), but many are read by the triton compiler. According to this advice https://fburl.com/workplace/imx9lsx3, we can import thread_safe_fork which installs some functionality to destroy some singletons before forking and re-enable them after. This apporach works for the failing workload. Test Plan: See D63719673 where the reporting user was kind enough to provide us with a local repro. Without the relevant import, we can reproduce the crash. With the import, the training runs successfully to completion. Differential Revision: D63736829 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137155 Approved by: https://github.com/xmfan, https://github.com/eellison	2024-10-03 17:37:21 +00:00
Tugsbayasgalan Manlaibaatar	f96020c246	Fix unlift to unblock training IR + run_decomp on aliasing constants (#137162 ) When we populate unlifted graph module, we actually only "unlift" constant tensor inputs which is problematic because export de-duplicates aliasing constants. As a result, we only register one constant instead of two constants. This PR fixes that by querying ep.constants table instead of ep.graph_signature.lifted_tensor_constants. Differential Revision: [D63743111](https://our.internmc.facebook.com/intern/diff/D63743111) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137162 Approved by: https://github.com/pianpwk	2024-10-03 17:28:53 +00:00
James Wu	4d3c0fc061	[AOTAutogradCache] add config for AOTAutograd remote cache (#137011 ) Summary: This just adds a config option and JK for turning on remote AOTAutogradCache. It does not implement anything with the new options being passed in. That will come next diff. This PR also changes the command for turning on the local AOTAutogradCache to be more consistent to that of FXGraphCache: TORCHINDUCTOR_AUTOGRAD_CACHE Test Plan: Existing tests should pass and should build Reviewed By: oulgen Differential Revision: D63321965 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137011 Approved by: https://github.com/oulgen	2024-10-03 16:03:47 +00:00
Bob Ren	a569a8ac4c	type _dynamo/external_utils.py (#137185 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137185 Approved by: https://github.com/Skylion007	2024-10-03 15:18:53 +00:00
Mikayla Gawarecki	b6cb174816	Fix serialization for torch.uint16, torch.uint32, torch.uint64 (#137184 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137184 Approved by: https://github.com/albanD	2024-10-03 14:56:11 +00:00
Yuanhao Ji	89b7a5d128	Implement `AcceleratorHooksInterface`'s virtual functions `deviceCount()` and `getCurrentDevice()` for CUDA and XPU (#136752 ) Fixes #136751 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136752 Approved by: https://github.com/albanD	2024-10-03 14:44:58 +00:00
atalman	63bbf712d8	Add py3.13t linux wheel build (#137127 ) Builder PR required: https://github.com/pytorch/builder/pull/2001 Test PR: https://github.com/pytorch/pytorch/pull/136490/ Pull Request resolved: https://github.com/pytorch/pytorch/pull/137127 Approved by: https://github.com/albanD	2024-10-03 13:13:48 +00:00
Yifu Wang	38114ec860	[async-tp] fix a race condition that can cause silent correctness issue (#137199 ) Details described in https://github.com/pytorch/pytorch/issues/137171: ![image](https://github.com/user-attachments/assets/8247b4f1-7805-4585-9d72-05e9475f218b) Fix: we introduce the following invariants in `_pipelined_all_gather_and_consume` and `_pipelined_produce_and_all2all`: - Before any stream writes to/reads from p2p buffers, perform a barrier on channel 0 on the launch stream. - After all streams completed writing to/reading from p2p buffers, perform a barrier on channel 0 on the launch stream. NOTE: This fix only focuses on addressing the race condition. Some barriers are exposed, which can be hidden by computation, and we'll optimize them in subsequent PRs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137199 Approved by: https://github.com/weifengpy	2024-10-03 10:42:37 +00:00
Vincent Moens	f166d62764	Avoid `__ne__` weakref comparison and use identity instead in cache_size.py (#135000 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135000 Approved by: https://github.com/anijain2305	2024-10-03 07:43:58 +00:00
Vincent Moens	bd9517c1ee	cond_batch_rule with boolean pred (#135009 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135009 Approved by: https://github.com/guilhermeleobas, https://github.com/jansel, https://github.com/zou3519	2024-10-03 07:43:30 +00:00
PyTorch MergeBot	0d1701f310	Revert "raw_alloc ignores PYTORCH_NO_CUDA_MEMORY_CACHING (#131114 )" This reverts commit 70019074806920f95976fedad775d7570294f635. Reverted https://github.com/pytorch/pytorch/pull/131114 on behalf of https://github.com/PaliC due to failing internal builds ([comment](https://github.com/pytorch/pytorch/pull/131114#issuecomment-2390615007))	2024-10-03 06:22:55 +00:00
Simon Fan	87bf2a8428	[compiled autograd] initialize cudagraph tls from context manager (#136735 ) FIXES https://github.com/pytorch/pytorch/issues/126934. Cudagraphs TLS is initialized on module import, but compiled autograd codepaths might not import it. This causes problems because autograd/compiled autograd will restore TLS state, and in this case will restore the TLS to an uninitialized state Should fix flaky cudagraph tests: https://github.com/pytorch/pytorch/issues/131663, https://github.com/pytorch/pytorch/issues/132108 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136735 Approved by: https://github.com/BoyuanFeng, https://github.com/eellison ghstack dependencies: #136059	2024-10-03 06:22:11 +00:00
Simon Fan	b86269fab5	Unify cpp_extension build directory removal (#136059 ) Keeps existing default directory clearing logic, even though it fails when TORCH_EXTENSIONS_DIR is set. To properly clear, we'd need to track all the folders we compiled the extensions to. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136059 Approved by: https://github.com/ezyang, https://github.com/albanD	2024-10-03 06:22:11 +00:00
wz337	55c343fa3a	[DTensor] Register replication strategy for a few upsampling interpolate ops (#137201 ) To unblock Llama 3.2 vision's use case to resize positional embeddings for fine-tuning. Context in [workplace post](https://fb.workplace.com/groups/319878845696681/permalink/1271172040567352/). Pull Request resolved: https://github.com/pytorch/pytorch/pull/137201 Approved by: https://github.com/XilunWu	2024-10-03 03:45:39 +00:00
drisspg	84cac3585d	Move _is_static_problem to mm_common (#137150 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137150 Approved by: https://github.com/eellison	2024-10-03 02:55:43 +00:00
drisspg	5c0ce8d0a6	Skip Flaky Test: for #134602 (#137226 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137226 Approved by: https://github.com/cpuhrsch	2024-10-03 01:53:59 +00:00
Jez Ng	b3953ff25e	[inductor] Reduce block sizes when using Triton CPU backend (#136612 ) This greatly reduces compile time; TorchBench models that were previously 50-100x slower (vs the cpp backend) are now ~20x slower. More work needs to be done on the Triton side, but smaller block sizes will still be helpful. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136612 Approved by: https://github.com/desertfire ghstack dependencies: #135342	2024-10-03 01:48:32 +00:00
Bin Bao	4513fb5c53	[Inductor] Use parametrize to break down some unit tests (#137156 ) Summary: To address the issue that some tests are marked as slow, see https://github.com/pytorch/pytorch/issues/136940#issuecomment-2387227598 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137156 Approved by: https://github.com/eellison	2024-10-03 01:43:36 +00:00
Ke Wen	7631a04081	[c10d] Fix the device query story of ProcessGroup (#136790 ) Function `_get_pg_default_device` is being used outside of `distributed_c10d.py`. A concern is that people may not be aware of what it actually does, due to bad naming of this function: `Return the device to use with ``group`` for control flow usage (object collectives, barrier).` The remediation is as follows: - Added a deprecation warning to `_get_pg_default_device`; - Added a private function `_get_object_coll_device` to undertake what it does; - Added a `_device_capability` function for users who want to query the device support of a PG -- it returns a plain list, no more "default" choice. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136790 Approved by: https://github.com/H-Huang	2024-10-03 01:36:22 +00:00
Avik Chaudhuri	cd5d1fe015	unflatten with specialized graphs per submodule call (#137013 ) Previously we were making a fairly restrictive assumption when unflattening an exported program: for any submodule, we would assert that the graph of every call to that submodule must be the same. This assertion is load-bearing, i.e., if we simply remove the assertion then we can get incorrect results, as shown by the following example. ``` class N(torch.nn.Module): def forward(self, x, b): if b: return x + 1 else: return x + 2 class M(torch.nn.Module): def __init__(self): super().__init__() self.n = N() def forward(self, x): x0 = x + 3 x1 = self.n(x0, True) x2 = x1 + 4 x3 = self.n(x2, False) return x3 + 5 m = M() inp = (torch.ones(1),) print(m(inp)) # tensor([16.]) ep = torch.export.export(m, inp) print(ep.module()(inp)) # tensor([16.]) unflattened = torch.export.unflatten(ep) print(unflattened(inp)) # tensor([15.]) ``` However, this goes against the spirit of specializing graphs when exporting: we should expect* that for every call to a submodule we might generate a different graph. The goal of this PR is to fix unflattening to handle multiple specialized graphs corresponding to multiple calls to the same submodule. The idea is simple: for every call to a child module `foo`, we will create potentially different child modules `foo`, `foo@1`, `foo@2`, etc. and use those names as targets in `callmodule` instructions in the parent graph. An immediate consequence of this is that the list of fqns in an unflattened module may not be the same as an exported module. Note that all these variants share the same parameters / buffers, so that multiple calls to the same submodule can share state as expected. However, as described so far this scheme may end up with needlessly too many submodules. Thus, between calls to the same submodule, if graphs are equal then we optimize away the extra submodules and reuse call names as much as possible. Moreover, when submodules are shared across fqns, we also try to de-duplicate graphs corresponding to their calls as much as possible. Note that no matter what, information about which submodule was called is still preserved, so that if a submodule has to be swapped with another, one can still find all calls to the former submodule and replace them with calls to the latter. A note on the choice of naming scheme for call names: instead of generating "sibling" modules `foo@1`, `foo@2`, etc. for `foo`, we had considered generating "children" modules `foo._1`, `foo._2`, etc. of `foo`. However this can cause spurious cycles when de-duplicating graphs. E.g., suppose that `foo` is an alias for `bar._1` and `foo._1` is an alias for `bar`, then we must either introduce a cycle or drop the opportunity to optimize. Another idea would be to make `foo` a dummy module that contains `foo._0` corresponding to the first call, but this necessitates too many changes to existing tests and hurts the common case. Differential Revision: D63642479 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137013 Approved by: https://github.com/pianpwk	2024-10-03 00:55:44 +00:00
atalman	6241006c28	Fix dependency on filesystem on Linux (#137209 ) Similar to: https://github.com/pytorch/pytorch/pull/134494 We are seeing come back of https://github.com/pytorch/pytorch/issues/133437 due to use of filesystem on Linux Pull Request resolved: https://github.com/pytorch/pytorch/pull/137209 Approved by: https://github.com/kit1980, https://github.com/malfet	2024-10-03 00:18:28 +00:00
Catherine Lee	235f7e06f4	[CI] upload_metrics function to upload to s3 instead of dynamo (#136799 ) * Upload_metrics function to upload to ossci-raw-job-status bucket instead of dynamo * Moves all added metrics to a field called "info" so ingesting into database table with a strict schema is easier * Removes the dynamo_key field since it is no longer needed * Removes the concept of reserved metrics, since they cannot be overwritten by user added metrics anymore * Moves s3 resource initialization behind a function so import is faster --- Tested by emitting a metric during run_test and seeing that documents got added to s3 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136799 Approved by: https://github.com/ZainRizvi	2024-10-02 23:19:28 +00:00
PyTorch MergeBot	2c9e194e23	Revert "[FSDP2] support torch._foreach_copy_(float8) for fully_shard(Float8Linear) (#135955 )" This reverts commit b50b3b32191e7192a28c54a417891f24df4e4dda. Reverted https://github.com/pytorch/pytorch/pull/135955 on behalf of https://github.com/PaliC due to breaking internal tests ([comment](https://github.com/pytorch/pytorch/pull/135955#issuecomment-2389810936))	2024-10-02 22:46:31 +00:00
drisspg	bb03ef7aca	[FlexAttention] Fix max-autotune when captured buffers are View nodes (#137204 ) ## Summary Originally reported in https://github.com/pytorch-labs/attention-gym/issues/45 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137204 Approved by: https://github.com/Chillee, https://github.com/BoyuanFeng	2024-10-02 22:19:33 +00:00
Shivam Raikundalia	759cd73adb	[Profiler] Update Kineto Submodule (#137137 ) Summary: Updating commits from Aug 7, 2024 to Sep 26, 2024 Test Plan: Phabricator + OSS CI Differential Revision: D63723255 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137137 Approved by: https://github.com/aaronenyeshi	2024-10-02 22:19:28 +00:00
Dan Zimmerman	e9e5d767b6	[inductor] Fix build_paths usage in config.py (#137187 ) Summary: In https://github.com/pytorch/pytorch/pull/136234 we accidentally used the old version of `build_paths`, but in https://github.com/pytorch/pytorch/pull/136952 the API slightly changed. This diff addresses that issue by updating the API usage. Test Plan: CI Reviewed By: ColinPeppler Differential Revision: D63764809 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137187 Approved by: https://github.com/ColinPeppler	2024-10-02 22:06:02 +00:00
Joel Schlosser	e95b230fd8	Fix NJT serialization (#137031 ) Fixes #129366 Since NJT has custom serialization logic, we need an NJT-specific fix to clear out cached sizes / strides PyCapsules. Eventually, we should switch NJT to use the default serialization logic, but this depends on #125622 being addressed. This PR also makes serialization more complete by explicitly handling `lengths`, `ragged_idx`, and the `metadata_cache`, ensuring working operation for both contiguous and non-contiguous NJTs, Pull Request resolved: https://github.com/pytorch/pytorch/pull/137031 Approved by: https://github.com/soulitzer ghstack dependencies: #137030	2024-10-02 21:41:35 +00:00
eqy	be423a8480	[SDPA] Bump `grad_query` fudge factor for Flash Attention (#135711 ) Tolerance issue for small GPUs e.g., (A16, A2) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135711 Approved by: https://github.com/Skylion007, https://github.com/drisspg	2024-10-02 21:35:00 +00:00
Gabriel Ferns	36fb342ffd	Check for fused kernel before inplace update (#137042 ) Summary: Given an op, with a pair (output buffer, input buffer) from that op, we consider marking the output buffer as inline. However, if the parent of input buffer and the current op are going to be fused, then we don't want to mark the output buffer as inline. This change checks that criterion, and skips inlining if it is so. Test Plan: New unit test "layer_norm_should_not_inplace" runs LayerNorm and checks for no "in_out" pointers. Fixes #120217 Here's a diagram of the issue: ![Inline+Fusion](https://github.com/user-attachments/assets/c03308d8-fdbf-40a0-a46d-964ece5f9e6d) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137042 Approved by: https://github.com/eellison	2024-10-02 21:14:34 +00:00
Shangdi Yu	a3f3773477	Make PT2E work with both IR simultaneously (#135769 ) Summary: as title Test Plan: ``` buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test:quantization_pt2e_qat ``` Differential Revision: D62449830 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135769 Approved by: https://github.com/angelayi	2024-10-02 21:05:22 +00:00
Howard Huang	4a9225fa1f	improve get_schedule_class() (#137103 ) Small change to make `get_schedule_class()` take case insensitive schedule names Pull Request resolved: https://github.com/pytorch/pytorch/pull/137103 Approved by: https://github.com/kwen2501	2024-10-02 20:08:25 +00:00
Jane Xu	2d465e4d1d	[non ghstack] Init threadpool with user defined num_threads before default (#137051 ) Very similar to https://github.com/pytorch/pytorch/pull/136793, but adds back `pool->set_thread_count` call as it is still necessary (I am guessing due to the mutex) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/137051 Approved by: https://github.com/albanD	2024-10-02 20:02:30 +00:00
Jovian Anthony Jaison	59d7cf7342	Add _dynamo.config inline_inbuilt_nn_modules and specialize_float logging (#137139 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137139 Approved by: https://github.com/ezyang	2024-10-02 19:58:38 +00:00
chilli	2b329d3bf1	Fix typo in _normalize ref (#137079 ) I think this should basically make no difference numerically, but it does have some ramifications on things like CSE. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137079 Approved by: https://github.com/Skylion007 ghstack dependencies: #136826, #137043, #137049, #137065	2024-10-02 19:06:48 +00:00
Joel Schlosser	6374a19a6e	Fix wrapper subclass serialization with custom sizes / strides (#137030 ) Fixes #130154 This PR takes the strategy outlined in the above issue and clears out any cached sizes / strides PyCapsules before serialization. This affects the default subclass serialization logic. The PyCapsule issue also affects `deepcopy`, so that's fixed here as well. Note: I originally tried utilizing a context manager to remove / restore cached PyCapsules after serialization, but in practice the state returned from `_reduce_ex_internal()` references the actual `tensor.__dict__()`, so the problem persists once the cached values are restored. Instead, we have to be careful to remove the cached values in the right place so they're not re-cached when pulling out size / stride information for serialization. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137030 Approved by: https://github.com/albanD	2024-10-02 18:55:03 +00:00
Xuehai Pan	8962610247	[BE][clang-format] make macro `PyObject_HEAD_INIT(type)` and `PyVarObject_HEAD_INIT(type, size)` have its own line (#136949 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136949 Approved by: https://github.com/albanD, https://github.com/eqy ghstack dependencies: #136945	2024-10-02 18:39:22 +00:00
Xuehai Pan	89c37be6b7	[BE][clang-format] make macro `PyObject_HEAD` have its own line (#136945 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136945 Approved by: https://github.com/albanD	2024-10-02 18:39:21 +00:00
Xilun Wu	54f50f19eb	[dtensor][experimental] expose DTensor Context Parallel API (#137038 ) Summary expose experimental Context Parallel API `torch.distributed.tensor.experimental._attention.context_parallel` to module `torch.distributed.tensor.experimental`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137038 Approved by: https://github.com/wz337, https://github.com/fegin	2024-10-02 18:00:23 +00:00
PyTorch MergeBot	4559cddaf9	Revert "Add py3.13t linux wheel build (#137127 )" This reverts commit 6b7adc12140d3073c5700cc1c48998556489857e. Reverted https://github.com/pytorch/pytorch/pull/137127 on behalf of https://github.com/jovianjaison due to Sorry for reverting your changes but 2 jobs are failing ([comment](https://github.com/pytorch/pytorch/pull/137127#issuecomment-2389250791))	2024-10-02 17:44:42 +00:00
Wei Feng	b50b3b3219	[FSDP2] support torch._foreach_copy_(float8) for fully_shard(Float8Linear) (#135955 ) this PR unblocks unit test with single Float8Linear module. It fixes following error ``` torch._foreach_copy_(foreach_copy_dsts, all_gather_inputs) [rank0]:E0913 13:44:29.829000 2179476 torch/testing/_internal/common_distributed.py:671] RuntimeError: "foreach_tensor_copy" not implemented for 'Float8_e4m3fn' ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/135955 Approved by: https://github.com/vkuzo, https://github.com/eqy	2024-10-02 17:26:45 +00:00
Henry Tsang	c318bafe9c	[inductor mkldnn test][BE] Use parametrize to shorten test run time (#137153 ) Summary: Tests in test_mkldnn_pattern_matcher.py can take too long to finish. Splitting them into smaller tests, using `parametrize`. I guess this means this test file has some refactoring opportunities as well. Next time would be the parametrize the add functions. Differential Revision: D63723925 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137153 Approved by: https://github.com/desertfire	2024-10-02 17:20:27 +00:00
Jean Schmidt	466623fb51	[CI] Support for CI GPU test and benchmark on containers (#137169 ) Renames the arc references to container, and add changes required so CI that requires GPU can run on containers Pull Request resolved: https://github.com/pytorch/pytorch/pull/137169 Approved by: https://github.com/huydhn	2024-10-02 17:10:59 +00:00
Jean Schmidt	e3fd4d796f	[CI] Skip sccache for nvcc builds when building for A100 (#137170 ) There is a unknown issue with nvcc builds and sccache, it crashes with: ``` /opt/cache/bin/sccache /usr/local/cuda-12.1/bin/nvcc -forward-unknown-to-host-compiler -DUSE_C10D_GLOO -DUSE_C10D_MPI -DUSE_C10D_NCCL -DUSE_DISTRIBUTED -DUSE_RPC -DUSE_TENSORPIPE -Dfbgemm_gpu_py_EXPORTS -I/tmp/pip-install-893ub5fd/fbgemm-gpu_f79a3c2737924c478e50ea29fedfa172/fbgemm_gpu -I/tmp/pip-install-893ub5fd/fbgemm-gpu_f79a3c2737924c478e50ea29fedfa172/fbgemm_gpu/include -I/tmp/pip-install-893ub5fd/fbgemm-gpu_f79a3c2737924c478e50ea29fedfa172/fbgemm_gpu/../include -I/tmp/pip-install-893ub5fd/fbgemm-gpu_f79a3c2737924c478e50ea29fedfa172/fbgemm_gpu/../third_party/asmjit/src -I/tmp/pip-install-893ub5fd/fbgemm-gpu_f79a3c2737924c478e50ea29fedfa172/fbgemm_gpu/../third_party/cpuinfo/include -isystem /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/include -isystem /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/cuda-12.1/include -DONNX_NAMESPACE=onnx_c2 -gencode arch=compute_50,code=sm_50 -gencode arch=compute_52,code=sm_52 -Xcudafe --diag_suppress=cc_clobber_ignored,--diag_suppress=field_without_dll_interface,--diag_suppress=base_class_has_different_dll_interface,--diag_suppress=dll_interface_conflict_none_assumed,--diag_suppress=dll_interface_conflict_dllexport_assumed,--diag_suppress=bad_friend_decl --expt-relaxed-constexpr --expt-extended-lambda -O3 -DNDEBUG -std=c++17 -Xcompiler=-fPIC -D_GLIBCXX_USE_CXX11_ABI=1 --expt-relaxed-constexpr -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ -MD -MT CMakeFiles/fbgemm_gpu_py.dir/src/sparse_ops/sparse_index_select.cu.o -MF CMakeFiles/fbgemm_gpu_py.dir/src/sparse_ops/sparse_index_select.cu.o.d -x cu -c /tmp/pip-install-893ub5fd/fbgemm-gpu_f79a3c2737924c478e50ea29fedfa172/fbgemm_gpu/src/sparse_ops/sparse_index_select.cu -o CMakeFiles/fbgemm_gpu_py.dir/src/sparse_ops/sparse_index_select.cu.o sccache: error: failed to execute compile sccache: caused by: error reading compile response from server sccache: caused by: Failed to read response header sccache: caused by: failed to fill whole buffer ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/137170 Approved by: https://github.com/huydhn	2024-10-02 17:07:24 +00:00
Jean Schmidt	d4cf90d282	[BE] [CI] Skip clean gha workspace if CI is running in a container for checkout-pytorch (#137168 ) For the reusable action checkout-pytorch, skips cleaning workspace when running from a container environment. The motivation for this change is twofold: * There is no need for cleanup when running in ephemeral containers, as any changes will be discarded when the docker container is terminated; * In the specific case of GITHUB_WORKSPACE, to enable sharing this between multiple containers, it need to be mounted with `-v`. This prevents the possibility of running `rm -r` and deleting this mount path; Pull Request resolved: https://github.com/pytorch/pytorch/pull/137168 Approved by: https://github.com/huydhn	2024-10-02 17:04:50 +00:00
Bob Ren	af3e25fea7	remove capture_autograd_function flag (#136959 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136959 Approved by: https://github.com/jansel	2024-10-02 16:59:19 +00:00
Bin Bao	bcaa0f5ee9	[CI] Remove nanogpt from perf smoke test (#137176 ) Summary: nanogpt's performance is not stable. Remove it from the perf smoke test. We may want to use another test instead. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137176 Approved by: https://github.com/eellison	2024-10-02 16:35:04 +00:00
Jeff Daily	7001907480	raw_alloc ignores PYTORCH_NO_CUDA_MEMORY_CACHING (#131114 ) raw_alloc is used by cudnn, miopen, thrust, and tunableop. Without this PR, the env var for disabling the caching allocator will only partially work. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131114 Approved by: https://github.com/eqy, https://github.com/houseroad, https://github.com/albanD Co-authored-by: Nichols A. Romero <nick.romero@amd.com>	2024-10-02 16:27:15 +00:00
Max Hu	a954a9ea75	[Inductor] External callable registration API for Matmul tuning candidates (#130774 ) Fixes #[130769](https://github.com/pytorch/pytorch/issues/130769) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130774 Approved by: https://github.com/jansel Co-authored-by: Jason Ansel <jansel@meta.com>	2024-10-02 15:38:10 +00:00
Animesh Jain	af86a6fdba	[dynamo][user-defined-class] Fallback when object.__new__ fails (#137044 ) Seen in https://github.com/vllm-project/vllm/pull/8949 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137044 Approved by: https://github.com/jansel	2024-10-02 14:15:36 +00:00
Yu, Guangye	d29094888b	Use torch.Stream&torch.Event for Dynamo capature (#134850 ) # Motivation This PR aims to solve the multiple Inheritance problem. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134850 Approved by: https://github.com/yf225, https://github.com/EikanWang	2024-10-02 14:15:33 +00:00
Brian Hirsh	bf73af4b4e	dont let partitioner think it can fuse pointwise ops into user triton kernels (#136878 ) Previously if we had a graph like: ``` triton_kernel_wrapper_functional_proxy = triton_kernel_wrapper_functional(...) getitem: "f32[3][1]cuda:0" = triton_kernel_wrapper_functional_proxy['out_ptr'] getitem_1: "f32[3][1]cuda:0" = triton_kernel_wrapper_functional_proxy['out2_ptr'] sigmoid: "f32[3][1]cuda:0" = torch.ops.aten.sigmoid.default(getitem_1) mul: "f32[3][1]cuda:0" = torch.ops.aten.mul.Tensor(tangents_1, sigmoid) ``` The partitioner would assume that the `sigmoid()` could be fused into either its user (the pointwise mul), or its producer (the user triton kernel). This could lead to a bad partitioning: (1) If the partitioner thinks we can fuse the sigmoid with its producer triton kernel, we would keep the sigmoid compute in the forward, and have to generate two separate kernels in the forward (user triton kernel, dedicated sigmoid kernel) (2) if the partitioner puts the sigmoid in the backward instead, we could fuse it with an existing backward kernel (the mul with a tangent) Reviewed By: embg Differential Revision: D63551393 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136878 Approved by: https://github.com/zou3519	2024-10-02 13:52:44 +00:00
Bin Bao	5c2c3ca10b	[Inductor] Fix test_conv2d_unary_cpu_cpp_wrapper failure (#137158 ) Summary: test_conv2d_unary_cpu_cpp_wrapper is failing on ciflow/slow because of mis-handling of inf. This PR fixes that. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137158 Approved by: https://github.com/chenyang78	2024-10-02 13:21:35 +00:00
Colin Peppler	d117ec1d6e	[3/3][Inductor] Make CK work in FBCode (#136234 ) Summary: # Context Goal: Enable CK for Inductor in FBCode We split this stack into three diffs to help with review & in case we need to revert anything. # This Diff * Gets us to have CK kernels as an option for GEMM autotuning in Inductor. Reviewed By: zjing14 Differential Revision: D62662705 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136234 Approved by: https://github.com/tenpercent, https://github.com/chenyang78	2024-10-02 12:17:38 +00:00
atalman	6b7adc1214	Add py3.13t linux wheel build (#137127 ) Builder PR required: https://github.com/pytorch/builder/pull/2001 Test PR: https://github.com/pytorch/pytorch/pull/136490/ Pull Request resolved: https://github.com/pytorch/pytorch/pull/137127 Approved by: https://github.com/albanD	2024-10-02 11:59:33 +00:00
Will Constable	8c29a0dd0e	[pipelining] Clean up dead code (#136804 ) 'set_requires_grad' dict appears to be always full of "False" values, and we always set requires_grad based on the value of 'has_backward' setting of required_grad field was being repeatedly done during get_fwd_recv_ops, but it should be done just once, so move it to the function that creates recv buffers in the first place. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136804 Approved by: https://github.com/kwen2501	2024-10-02 11:26:31 +00:00
cyy	862029a1ef	[Distributed] [15/N] Fix clang-tidy warnings in torch/csrc/distributed/ (#137072 ) Follows #136848 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137072 Approved by: https://github.com/kwen2501	2024-10-02 10:56:15 +00:00
Bob Ren	ed02309232	type _dynamo/create_parameter_op.py (#136958 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136958 Approved by: https://github.com/jansel	2024-10-02 10:23:37 +00:00
Mu-Chu Lee	52d29a2b94	[reland #136389 ] Skip kernel saving if already existed (#137073 ) Summary: We skip the save_gpu_kernel if kernel is being saved already. This would give us a more accurate Triton profiling result. The following trace shows before/after the change for a benchmarking of a trivial addmm: Before: <img width="1255" alt="Screenshot 2024-09-23 at 10 26 53 AM" src="https://github.com/user-attachments/assets/5aea05ef-6ef0-464c-8da9-17b31c97b43a"> After: <img width="910" alt="Screenshot 2024-09-23 at 10 27 03 AM" src="https://github.com/user-attachments/assets/488b7d4f-268f-41cf-8553-cb16ceeae118"> We can see that before the change, the benchmarking includes two parts, (1) The overhead of our triton_heuristic call, which includes the save/get, and the (expensive) hash computation. (2) The exact computation of Triton kernel. We see that (1) accounts >50% of time, which makes kernel selection for profiling choosing aten kernels over Triton kernels. Test Plan: Existing OSS CI python test/inductor/test_cuda_cpp_wrapper.py Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/137073 Approved by: https://github.com/desertfire	2024-10-02 09:27:08 +00:00
zeshengzong	e374d6850a	[distributed][test] Remove unused variable and fix doc typo (#136943 ) Refactor distributed test code: - Fix TODO: Remove unused variable - Fix doc typo - Migrate deprecated method call `load_state_dict` and `save_state_dict` Pull Request resolved: https://github.com/pytorch/pytorch/pull/136943 Approved by: https://github.com/H-Huang	2024-10-02 08:31:53 +00:00
Jason Ansel	e9a55b43a1	[inductor] Support lists of tensors in operatorbench (#136911 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136911 Approved by: https://github.com/eellison	2024-10-02 06:41:06 +00:00
Will Feng	a89e3c2490	Add compiled_autograd_kwargs_override Dynamo config (#136967 ) For Traceable FSDP2, the most common use case is to have `fullgraph=False` for forward pass (to allow user-level graph breaks), and `fullgraph=True` for compiled autograd backward pass (required for queue_callback support). With `torch._dynamo.compiled_autograd=True`, previously we are not able to set different `fullgraph` config value for forward vs. backward pass, since `rebuild_ctx` just reuses the forward compile config as-is. This PR adds `torch._dynamo.config.compiled_autograd_kwargs_override` config to allow forcing `fullgraph=True` for CA Dynamo tracing. With this PR, we can remove standalone compiled autograd ctx manager usage in Traceable FSDP2 unit tests, and consolidate on using `torch._dynamo.compiled_autograd=True`. Test commands: - `pytest -rA test/distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_transformer_backend_inductor_fullgraph_True` Pull Request resolved: https://github.com/pytorch/pytorch/pull/136967 Approved by: https://github.com/xmfan	2024-10-02 06:23:59 +00:00
Nikita Shulga	b51d22b8bb	[BE] [NEON] Use `vshlq_n_u32` instead of `vshlq_u32` (#137122 ) As compiler optimizes it away anyway Pull Request resolved: https://github.com/pytorch/pytorch/pull/137122 Approved by: https://github.com/kit1980	2024-10-02 06:18:11 +00:00
chilli	2854d157de	Add type annotations for higher order ops/flex_attention (#137065 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137065 Approved by: https://github.com/drisspg, https://github.com/Skylion007 ghstack dependencies: #136826, #137043, #137049	2024-10-02 04:39:25 +00:00
atalman	3b8511dadf	Remove python 3.8 from triton builds (#137141 ) All jobs have switched to Python 3.9. These 3.8 builds no longer necessary Pull Request resolved: https://github.com/pytorch/pytorch/pull/137141 Approved by: https://github.com/albanD	2024-10-02 03:36:54 +00:00
Bin Bao	8e39f2a4a5	[Inductor] Enable a SDPA pattern matching for CUDA (#137085 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/122429 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137085 Approved by: https://github.com/eellison	2024-10-02 03:22:11 +00:00
alenawang	18525e185e	Fix rendezvous error due to EtcdStore get method not waiting in some cases (#137056 ) Fixes #132950 This fixes an issue in `torch/distributed/elastic/rendezvous/etcd_store.py` where the [get method](https://github.com/pytorch/pytorch/blob/v2.4.0/torch/distributed/elastic/rendezvous/etcd_store.py#L60) does not wait as expected when no keys have been written under the store prefix yet (and therefore the store prefix key does not exist). This was because the `_try_wait_get` method would error out immediately [here](https://github.com/alenawang/pytorch/blob/main/torch/distributed/elastic/rendezvous/etcd_store.py#L179) if the prefix was not found instead of continuing to the etcd watch. This was causing upstream issues where distributed jobs using etcd-v2 could not get past the initial rendezvous at all (details in issue #132950). We added a test demonstrating this issue and the fix. Without the fix the test fails with `etcd.EtcdKeyNotFound: Key not found : /torch/elastic/store` instead of waiting for the first key to be written; with the fix the test waits properly. Co-authored-by: tarat44 <32471142+tarat44@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/137056 Approved by: https://github.com/fduwjj Co-authored-by: tarat44 <32471142+tarat44@users.noreply.github.com>	2024-10-02 01:45:00 +00:00
Ruben Rodriguez Buchillon	f108f88c40	[logging/debugging] handle None (constant) args in debug log (#137032 ) Summary: # Why The arguments are filtered out as they are just const in the compiled graph, but the logger still expects a non-None type # What When passing a filtered out arg (None) to the debug logger, just log that it's a filtered out argument, instead of throwing a Type error # Background https://github.com/pytorch/pytorch/pull/131594 Test Plan: - execute repro from https://github.com/pytorch/pytorch/issues/135584#issue-2516944089 with and without the edits Differential Revision: D63652564 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137032 Approved by: https://github.com/angelayi	2024-10-02 01:43:22 +00:00
Benjamin Glass	f984b88718	Ensure noncontiguous tensor creation tests offsetting (#136396 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136396 Approved by: https://github.com/amjames, https://github.com/eellison ghstack dependencies: #136055	2024-10-02 00:40:43 +00:00
Benjamin Glass	c7638da558	Lowerings: remove restriction on TensorBox keyword arguments (#136055 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136055 Approved by: https://github.com/eellison	2024-10-02 00:40:43 +00:00
abhishek-fujitsu	63d6908da0	fix build error with gcc 12+ (#137092 ) Fixes #127920 This commit addresses a build failure occurring with GCC 12 and above due to the -Werror=nonnull flag. The error manifests in the test_api target. Issue: When building with GCC 12+, the following error occurs: ``` error: argument 1 null where non-null expected [-Werror=nonnull] 431 \| __builtin_memmove(__result, __first, sizeof(_Tp) * _Num); \| ~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ``` This change ensures that: 1. The flag is only added for GCC 12 or higher 2. The flag is only added if it's supported by the compiler 3. The flag is added specifically to the test_api target, not globally By disabling this specific error, we allow the build to proceed while maintaining other compiler warnings. Test Plan: - Verified successful build with GCC 12 and above - Ensured no regression in builds with earlier GCC versions and other compilers Pull Request resolved: https://github.com/pytorch/pytorch/pull/137092 Approved by: https://github.com/malfet	2024-10-02 00:37:15 +00:00
Angela Yi	d725758210	[ts_converter] Fix prim::If buffer names (#136648 ) Summary: We previously incorrectly handled the following graph, specifically for the node `w.3` in `block0`: ``` graph(%x.1 : Float(3, strides=[1], requires_grad=0, device=cpu), %y.1 : int): %2 : __torch__.___torch_mangle_1.M = prim::CreateObject() %3 : int = prim::Constant[value=20](), scope: M:: # /data/users/angelayi/pytorch/test/export/test_converter.py:747:34 %4 : int = prim::Constant[value=10](), scope: M:: # /data/users/angelayi/pytorch/test/export/test_converter.py:746:34 %5 : int = prim::Constant[value=1](), scope: M:: %w.1 : int = prim::GetAttr[name="w"](%2), scope: M:: %7 : int = aten::mul(%w.1, %4), scope: M:: # /data/users/angelayi/pytorch/test/export/test_converter.py:746:25 = prim::SetAttr[name="w"](%2, %7), scope: M:: %h.1 : int = prim::GetAttr[name="h"](%2), scope: M:: %9 : int = aten::mul(%h.1, %3), scope: M:: # /data/users/angelayi/pytorch/test/export/test_converter.py:747:25 = prim::SetAttr[name="h"](%2, %9), scope: M:: %10 : bool = aten::gt(%y.1, %4), scope: M:: # /data/users/angelayi/pytorch/test/export/test_converter.py:749:19 %res.37 : Tensor = prim::If(%10), scope: M:: # /data/users/angelayi/pytorch/test/export/test_converter.py:749:16 block0(): %w.3 : int = prim::GetAttr[name="w"](%2), scope: M:: %res.1 : Tensor = aten::add(%x.1, %w.3, %5), scope: M:: # <string>:5:9 -> (%res.1) block1(): %h.3 : int = prim::GetAttr[name="h"](%2), scope: M:: %res.3 : Tensor = aten::add(%x.1, %h.3, %5), scope: M:: # <string>:5:9 -> (%res.3) %16 : bool = aten::lt(%y.1, %4), scope: M:: # /data/users/angelayi/pytorch/test/export/test_converter.py:754:19 %res : Tensor = prim::If(%16), scope: M:: # /data/users/angelayi/pytorch/test/export/test_converter.py:754:16 block0(): %w : int = prim::GetAttr[name="w"](%2), scope: M:: %res.15 : Tensor = aten::add(%res.37, %w, %5), scope: M:: # <string>:5:9 -> (%res.15) block1(): %h : int = prim::GetAttr[name="h"](%2), scope: M:: %res.21 : Tensor = aten::add(%res.37, %h, %5), scope: M:: # <string>:5:9 -> (%res.21) return (%res) ``` Test Plan: CI Differential Revision: D63399064 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136648 Approved by: https://github.com/SherlockNoMad	2024-10-02 00:07:47 +00:00
Sahan Paliskara	8765804542	Continue on error for pytorch autolint (#137104 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/137104 Approved by: https://github.com/huydhn, https://github.com/atalman	2024-10-01 22:30:36 +00:00
Zain Rizvi	f0fa460c60	[BE] Add script to keept the runner-determinator scripts in sync (#136794 ) Whenever we update runner_determinator.py it needs to be copied over into _runner-determinator.yml. This is a quick utility script to make that process less tedious Pull Request resolved: https://github.com/pytorch/pytorch/pull/136794 Approved by: https://github.com/zxiiro, https://github.com/jeanschmidt	2024-10-01 22:26:28 +00:00
albanD	4f93de8951	Mark PyTorch module as no-gil valid and pythoncapi_compat.h (#136899 ) PyList_GetItem are audited but not other APIs yet (they will be done in a follow up PR to keep this one small enough). Pull Request resolved: https://github.com/pytorch/pytorch/pull/136899 Approved by: https://github.com/colesbury, https://github.com/atalman	2024-10-01 22:05:35 +00:00
Catherine Lee	6baee60e3c	upload test stats: remove nan/inf when uploading (#136877 ) `json.dumps(float("inf"))` returns `Infinity`, which is technically invalid json This is fine if you json.load, but ClickHouse cannot handle it Solution here: cast inf and nan to string (which ClickHouse is able to cast back to float) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136877 Approved by: https://github.com/huydhn	2024-10-01 21:47:46 +00:00
eellison	0788d016d6	Update incompatible cudagraph ops skip message (#137015 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137015 Approved by: https://github.com/BoyuanFeng	2024-10-01 21:23:36 +00:00
drisspg	34c18887ad	[FlexAttention] Remove restriction on QK headdim > V headdim (#135884 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135884 Approved by: https://github.com/Chillee	2024-10-01 21:17:54 +00:00
Jez Ng	99eb47fb6d	Add CI for Triton CPU backend (#135342 ) Where possible, I have marked failing tests (which we intend to fix or triage) as `@xfail_if_triton_cpu`. This will help us track progress of the Triton CPU backend over time. Tests that I don't think we need to address, or that are flaky, have been marked as skips. Successful CI run: https://github.com/pytorch/pytorch/actions/runs/10822238062/job/30028284549 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135342 Approved by: https://github.com/jansel, https://github.com/desertfire, https://github.com/malfet	2024-10-01 20:43:10 +00:00
PyTorch MergeBot	86b715c5f6	Revert "Skip kernel saving if already existed. (#136389 )" This reverts commit 2521cd387482a70d30e4ea922fa4fe3b488c9f6d. Reverted https://github.com/pytorch/pytorch/pull/136389 on behalf of https://github.com/muchulee8 due to Issue #136940 ([comment](https://github.com/pytorch/pytorch/pull/136389#issuecomment-2386950623))	2024-10-01 20:04:43 +00:00
PyTorch MergeBot	b53ab8b86a	Revert "[dtensor][experimental] expose DTensor Context Parallel API (#137038 )" This reverts commit e23e766cc089b568aa4c0ebf0747ff9b504b8915. Reverted https://github.com/pytorch/pytorch/pull/137038 on behalf of https://github.com/huydhn due to Sorry for reverting your changes but the doc build failure looks legit ([comment](https://github.com/pytorch/pytorch/pull/137038#issuecomment-2386902253))	2024-10-01 19:49:28 +00:00
Menglu Yu	a00f0d5db8	[PT2][Inductor] Add runtime numeric check for the post grad pass (#136724 ) Summary: Similar to D51838043, we further add post grad runtime numeric check since some graph passes are performed at aten-level. Differential Revision: D63438718 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136724 Approved by: https://github.com/Yuzhen11	2024-10-01 18:56:01 +00:00
Edward Z. Yang	d61e45283e	Properly interpolate sloc here (#137088 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/137088 Approved by: https://github.com/Skylion007	2024-10-01 18:33:03 +00:00
Jun Luo	c2dee8ea9c	enable lazy init for MTIA (#136902 ) Summary: As title. Test Plan: OSS and Internal CIs Reviewed By: nautsimon, hanzlfs Differential Revision: D63434511 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136902 Approved by: https://github.com/nautsimon	2024-10-01 18:30:56 +00:00
Nikita Shulga	1f3a793790	Fix PyTorch builds on MacOS-13 (#137095 ) By including SonomaOps header Fixes https://github.com/pytorch/pytorch/issues/137094 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137095 Approved by: https://github.com/atalman, https://github.com/ZainRizvi	2024-10-01 17:56:35 +00:00
Xilun Wu	e23e766cc0	[dtensor][experimental] expose DTensor Context Parallel API (#137038 ) Summary expose experimental Context Parallel API `torch.distributed.tensor.experimental._attention.context_parallel` to module `torch.distributed.tensor.experimental`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137038 Approved by: https://github.com/wz337, https://github.com/fegin	2024-10-01 17:41:28 +00:00
Tugsbayasgalan Manlaibaatar	73b07df042	Preserve custom ops via run_decomps (#136882 ) This is re-apply of https://github.com/pytorch/pytorch/pull/136773?fbclid=IwZXh0bgNhZW0CMTEAAR3SmginkvZcILVY7G2XDa_KosnV4DPmq1l6pkjPIM255QgJLKVAR90rGAU_aem_ZWpcVdUsmAGzOGiwbjtBDg. Note that this doesn't completely remove the _preserve_ops list from export mainly because we want to have small change to address failing executorch tests. All the complications included in this PR is deleted in the next PR. Differential Revision: [D63553086](https://our.internmc.facebook.com/intern/diff/D63553086/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136882 Approved by: https://github.com/bdhirsh	2024-10-01 17:38:00 +00:00
Ruben Rodriguez Buchillon	b1b6816e05	[testing] reenable kernel_benchmark.py tests (#136876 ) Summary: # Why We want this to run internally # What - fix python path issue on the test - reenable the test # Background (copied from similar issue resolved earlier) It appears that the parent process does not pass the entire path down to the child process. Namely, if there is some setup that makes the sys.path effectively look different than, say, PYTHONPATH or something like this, the child will not inherit this setup. To avoid needing to keep track of specific setups, we pass the effective `sys.path` from the parent to the child through the PYTHONPATH env variable Test Plan: buck2 test 'fbcode//mode/opt' fbcode//caffe2/test/inductor:kernel_benchmark Differential Revision: D63498897 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136876 Approved by: https://github.com/henrylhtsang	2024-10-01 17:16:21 +00:00
Nikita Shulga	3d0cb81594	[MPS] Enable bfloat16 testing (#136987 ) By even further reducing precisions of imprecise FP16 ops, introducing new BF16_LOW_PRECISION_OPS category and marking BF16 tests as xfail for `divfloor_rounding`, `floor_divide` and `remainder`. I guess the nature of low-precision results, is that MPSGraph, unlike the rest of the PyTorch does not do accumulation over fp32 for reduction operations Pull Request resolved: https://github.com/pytorch/pytorch/pull/136987 Approved by: https://github.com/albanD ghstack dependencies: #137070	2024-10-01 17:10:07 +00:00
Pian Pawakapan	cc2a66c55e	[export] hook up mark_dynamic to export Dims (#137029 ) Adds Dim.DYNAMIC which calls torch._dynamo.mark_dynamic() in the backend. Similar to Dim.AUTO in that it does automatic inference for ranges & relations, but errors out for specializations. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137029 Approved by: https://github.com/avikchaudhuri	2024-10-01 17:05:09 +00:00
Isuru Fernando	ef6fd3d780	Fix adaptive_max_pool2d fallback (#136367 ) Fixes #136332 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136367 Approved by: https://github.com/amjames, https://github.com/eellison	2024-10-01 16:20:34 +00:00
Nikita Shulga	8f4f7bed5d	[MPS] Fix bfloat to complex casts (#137070 ) For Metal cast ops to comple, one need to explicitly cast to/from `bfloat` unlike for other dtypes Tested in https://github.com/pytorch/pytorch/pull/136987 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137070 Approved by: https://github.com/Skylion007	2024-10-01 15:47:29 +00:00
PyTorch MergeBot	696d01aef3	Revert "inductor: use previous guards to know if a size is 1 for broadcasting (#136670 )" This reverts commit dfdda2f6a603ae9245f38a3e8f6365c3cb6d49ac. Reverted https://github.com/pytorch/pytorch/pull/136670 on behalf of https://github.com/ZainRizvi due to Something in this stack seems to be causing tests to fail on trunk. See functorch/test_control_flow.py::TestControlFlow::test_associative_scan_dim_reverse_True_combine_mode_generic_cuda [GH job link](https://github.com/pytorch/pytorch/actions/runs/11107079955/job/30872132411) [HUD commit link](`c010c6099b`) ([comment](https://github.com/pytorch/pytorch/pull/136670#issuecomment-2386303362))	2024-10-01 15:23:55 +00:00
PyTorch MergeBot	951107e8c2	Revert "compile time benchmarks for AOTDispatcher (inference/training/subclasses) (#136759 )" This reverts commit b17cd264d38ca3381391c449bdaf9f03381caf35. Reverted https://github.com/pytorch/pytorch/pull/136759 on behalf of https://github.com/ZainRizvi due to Something in this stack seems to be causing tests to fail on trunk. See functorch/test_control_flow.py::TestControlFlow::test_associative_scan_dim_reverse_True_combine_mode_generic_cuda [GH job link](https://github.com/pytorch/pytorch/actions/runs/11107079955/job/30872132411) [HUD commit link](`c010c6099b`) ([comment](https://github.com/pytorch/pytorch/pull/136670#issuecomment-2386303362))	2024-10-01 15:23:55 +00:00
PyTorch MergeBot	923410193b	Revert "compile time benchmarks for AOTDispatcher (partitioner) (#136760 )" This reverts commit c010c6099bf304bbb681af534b9f3996c33ce582. Reverted https://github.com/pytorch/pytorch/pull/136760 on behalf of https://github.com/ZainRizvi due to Something in this stack seems to be causing tests to fail on trunk. See functorch/test_control_flow.py::TestControlFlow::test_associative_scan_dim_reverse_True_combine_mode_generic_cuda [GH job link](https://github.com/pytorch/pytorch/actions/runs/11107079955/job/30872132411) [HUD commit link](`c010c6099b`) ([comment](https://github.com/pytorch/pytorch/pull/136670#issuecomment-2386303362))	2024-10-01 15:23:55 +00:00
Bob Ren	8f5c2b5f17	type _dynamo/test_case.py (#136957 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136957 Approved by: https://github.com/Skylion007	2024-10-01 14:36:22 +00:00
Bob Ren	d4cc2aaf1e	type _dynamo/logging.py (#136956 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136956 Approved by: https://github.com/Skylion007	2024-10-01 14:35:54 +00:00
PyTorch MergeBot	7303716005	Revert "Simplify find_localzeros (#133325 )" This reverts commit 99f90c379ed214ab30882a87bdb3924ed6d6c899. Reverted https://github.com/pytorch/pytorch/pull/133325 on behalf of https://github.com/ezyang due to https://fb.workplace.com/groups/gpuinference/permalink/2921405651341417/ ([comment](https://github.com/pytorch/pytorch/pull/133325#issuecomment-2385832600))	2024-10-01 13:25:03 +00:00
Edward Z. Yang	6bd9d37266	Remove allow-untyped-defs from torch.fx.experimental.symbolic_shapes (#137019 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/137019 Approved by: https://github.com/Skylion007 ghstack dependencies: #136934, #136935, #136972	2024-10-01 13:22:10 +00:00
Edward Z. Yang	cc8f1cddd4	Turn on type-checking in torch.fx.experimental.symbolic_shapes (#136972 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/136972 Approved by: https://github.com/Skylion007 ghstack dependencies: #136934, #136935	2024-10-01 13:22:10 +00:00
Tom Ritchford	b85f21fc1d	Add decomposition for squeeze_copy (#130941 ) * Extracted from #128416 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130941 Approved by: https://github.com/amjames, https://github.com/eellison ghstack dependencies: #136653	2024-10-01 10:23:22 +00:00
chilli	083921852b	set FlexAttention devices properly during tracing (#137049 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137049 Approved by: https://github.com/zou3519, https://github.com/drisspg, https://github.com/yanboliang ghstack dependencies: #136826, #137043	2024-10-01 09:08:08 +00:00
chilli	34cef1eaa7	Allow automatic dynamic shapes for closures and set current node properly in flexattention subgraph lowering (#137043 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137043 Approved by: https://github.com/drisspg ghstack dependencies: #136826	2024-10-01 09:08:08 +00:00
Haifeng Jin	37dd924c2d	Fix test/test_linalg.py for NumPy 2 (#136800 ) Related to #107302. When built and tested with NumPy 2 the following unit tests failed. ``` =========================================================== short test summary info ============================================================ FAILED [0.0026s] test/test_linalg.py::TestLinalgCPU::test_householder_product_cpu_complex128 - TypeError: expected np.ndarray (got Tensor) FAILED [0.0024s] test/test_linalg.py::TestLinalgCPU::test_householder_product_cpu_complex64 - TypeError: expected np.ndarray (got Tensor) FAILED [0.0025s] test/test_linalg.py::TestLinalgCPU::test_householder_product_cpu_float32 - TypeError: expected np.ndarray (got Tensor) FAILED [0.0024s] test/test_linalg.py::TestLinalgCPU::test_householder_product_cpu_float64 - TypeError: expected np.ndarray (got Tensor) FAILED [0.0016s] test/test_linalg.py::TestLinalgCPU::test_nuclear_norm_axes_small_brute_force_old_cpu - ValueError: Unable to avoid copy while creating an array as requested. FAILED [0.0054s] test/test_linalg.py::TestLinalgCPU::test_solve_cpu_complex128 - AssertionError: The values for attribute 'shape' do not match: torch.Size([0, 0]) != torch.Size([0, 0, 0]). FAILED [0.0055s] test/test_linalg.py::TestLinalgCPU::test_solve_cpu_complex64 - AssertionError: The values for attribute 'shape' do not match: torch.Size([0, 0]) != torch.Size([0, 0, 0]). FAILED [0.0048s] test/test_linalg.py::TestLinalgCPU::test_solve_cpu_float32 - AssertionError: The values for attribute 'shape' do not match: torch.Size([0, 0]) != torch.Size([0, 0, 0]). FAILED [0.0054s] test/test_linalg.py::TestLinalgCPU::test_solve_cpu_float64 - AssertionError: The values for attribute 'shape' do not match: torch.Size([0, 0]) != torch.Size([0, 0, 0]). =========================================== 9 failed, 1051 passed, 118 skipped in 152.51s (0:02:32) ============================================ ``` This PR fixes them. The test is now compatible with both NumPy 1 & 2. Some more details: 1. The `np.linalg.solve` has changed its behavior. So I added an adapt function in the unit test to keep its behavior the same no matter it is NumPy 1 or Numpy 2. 2. The cause of the failure is when passing a `torch.Tensor` to `np.linalg.qr`, the return type in NumPy 1 is `(np.ndarray, np.ndarray)`, while it is `(torch.Tensor, torch.Tensor)` in NumPy 2. 3. NumPy 2 does not allow `np.array(obj, copy=False)`, but recommended to use `np.asarray(obj)` instead. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136800 Approved by: https://github.com/lezcano	2024-10-01 07:53:24 +00:00
Yu, Guangye	df5bbc09d1	Make device-specific event inherits from torch.Event (#134845 ) # Motivation This PR intends to make device-specific Event inherit from the generic torch.Event. The benefit is providing a generic abstract class `torch.Event` for different devices, like `torch.Stream`. This make it easier for Dynamo to capture the Event of different devices, like torch.cuda.Event and torch.xpu.Event. And the next PR would like to remove previous useless base class `_StreamBase` and `_EventBase` to avoid multiple Inheritance. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134845 Approved by: https://github.com/albanD, https://github.com/EikanWang	2024-10-01 06:28:41 +00:00
cyy	47a78daf91	[Environment Variable][1/N] Use thread-safe env variable API in c10 (#119449 ) This PR is the beginning of attempts to wrap thread-unsafe getenv and set_env functions inside a RW mutex. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119449 Approved by: https://github.com/malfet, https://github.com/albanD, https://github.com/eqy	2024-10-01 06:24:30 +00:00
Yuanhao Ji	be169f743b	[Dynamo] Mark `config.dead_code_elimination` as deprecated (#136933 ) part of #136862 For reviewers, all call sites are here: https://github.com/search?q=repo%3Apytorch%2Fpytorch+dead_code_elimination+language%3APython&type=code&l=Python Pull Request resolved: https://github.com/pytorch/pytorch/pull/136933 Approved by: https://github.com/williamwen42, https://github.com/anijain2305	2024-10-01 03:51:59 +00:00
Simon Fan	6e10f7d8c1	[compiled autograd] undo view_to_reshape inductor fx pass in node name matching (#136741 ) inductor mutates the aot backward graph. a solution could be to copy the graph, but since we don't know if compiled autograd is applied or not, it would be expensive to always clone it Pull Request resolved: https://github.com/pytorch/pytorch/pull/136741 Approved by: https://github.com/jansel ghstack dependencies: #135663	2024-10-01 03:22:49 +00:00
Simon Fan	40157db5a7	[compiled autograd] log placeholder origin in verbose (#135663 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135663 Approved by: https://github.com/jansel	2024-10-01 03:22:49 +00:00
Henry Tsang	6966811da6	[test] skip not omit big gpu tests for cuda_cpp_wrapper (#137055 ) Summary: Problem is, when gpu is not big, we will omit the test cases in the test class. We expect the test to be skipped, but due to fbcode ci it can throw an error. This causes the test to be flaky. Test Plan: ci Differential Revision: D62037908 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137055 Approved by: https://github.com/masnesral	2024-10-01 03:03:27 +00:00
cyy	17455695d6	[Distributed] [14/N] Fix clang-tidy warnings in torch/csrc/distributed/ (#136848 ) Follows #136713 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136848 Approved by: https://github.com/H-Huang	2024-10-01 02:01:13 +00:00
Edward Z. Yang	951af3d3d8	Format torch.fx.experimental.validator (#136935 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/136935 Approved by: https://github.com/Skylion007 ghstack dependencies: #136934	2024-10-01 01:47:17 +00:00
Edward Z. Yang	33c2d3232f	Format torch.fx.experimental.symbolic_shapes with PYFMT (#136934 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/136934 Approved by: https://github.com/Skylion007	2024-10-01 01:47:16 +00:00
chilli	d9c400bd9f	Added some tests to prevent regressions in partitioning and flexattention (#136826 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136826 Approved by: https://github.com/yanboliang, https://github.com/drisspg	2024-10-01 01:08:44 +00:00
niklasz	3f457ee1f6	Fix AOT Graph capture not propagating non_blocking copy parameter to … (#136513 ) …inductor codegen. Fixes #136260 Note: this is my first code contribution to torch so please let me know if there's anything I need to fix/some other convention I should follow. Regarding the bug, re-running the issue's reproduction code: ``` import torch def fn(x): return x.to(device="cuda", non_blocking=True) inp = torch.randn(3, 4) torch.compile(fn)(inp) ``` We now have the non_blocking being passed on to codegen properly: ``` V0922 20:33:25.393000 679839 torch/fx/passes/runtime_assert.py:114] [0/0] [__graph_code] TRACED GRAPH V0922 20:33:25.393000 679839 torch/fx/passes/runtime_assert.py:114] [0/0] [__graph_code] ===== pre insert_deferred_runtime_asserts __compiled_fn_1 ===== V0922 20:33:25.393000 679839 torch/fx/passes/runtime_assert.py:114] [0/0] [__graph_code] <eval_with_key>.0 class GraphModule(torch.nn.Module): V0922 20:33:25.393000 679839 torch/fx/passes/runtime_assert.py:114] [0/0] [__graph_code] def forward(self, L_x_: "f32[3, 4]"): V0922 20:33:25.393000 679839 torch/fx/passes/runtime_assert.py:114] [0/0] [__graph_code] l_x_ = L_x_ V0922 20:33:25.393000 679839 torch/fx/passes/runtime_assert.py:114] [0/0] [__graph_code] V0922 20:33:25.393000 679839 torch/fx/passes/runtime_assert.py:114] [0/0] [__graph_code] # File: /home/niklasz/Desktop/pytorch/temp/reproduction.py:4 in fn, code: return x.to(device="cuda", non_blocking=True) V0922 20:33:25.393000 679839 torch/fx/passes/runtime_assert.py:114] [0/0] [__graph_code] to: "f32[3, 4]" = l_x_.to(device = 'cuda', non_blocking = True); l_x_ = None V0922 20:33:25.393000 679839 torch/fx/passes/runtime_assert.py:114] [0/0] [__graph_code] return (to,) V0922 20:33:25.393000 679839 torch/fx/passes/runtime_assert.py:114] [0/0] [__graph_code] V0922 20:33:25.393000 679839 torch/fx/passes/runtime_assert.py:114] [0/0] [__graph_code] V0922 20:33:25.394000 679839 torch/_dynamo/output_graph.py:1340] [0/0] [__graph_code] TRACED GRAPH V0922 20:33:25.394000 679839 torch/_dynamo/output_graph.py:1340] [0/0] [__graph_code] ===== __compiled_fn_1 ===== V0922 20:33:25.394000 679839 torch/_dynamo/output_graph.py:1340] [0/0] [__graph_code] /home/niklasz/Desktop/pytorch/torch/fx/_lazy_graph_module.py class GraphModule(torch.nn.Module): V0922 20:33:25.394000 679839 torch/_dynamo/output_graph.py:1340] [0/0] [__graph_code] def forward(self, L_x_: "f32[3, 4][4, 1]cpu"): V0922 20:33:25.394000 679839 torch/_dynamo/output_graph.py:1340] [0/0] [__graph_code] l_x_ = L_x_ V0922 20:33:25.394000 679839 torch/_dynamo/output_graph.py:1340] [0/0] [__graph_code] V0922 20:33:25.394000 679839 torch/_dynamo/output_graph.py:1340] [0/0] [__graph_code] # File: /home/niklasz/Desktop/pytorch/temp/reproduction.py:4 in fn, code: return x.to(device="cuda", non_blocking=True) V0922 20:33:25.394000 679839 torch/_dynamo/output_graph.py:1340] [0/0] [__graph_code] to: "f32[3, 4][4, 1]cuda:0" = l_x_.to(device = 'cuda', non_blocking = True); l_x_ = None V0922 20:33:25.394000 679839 torch/_dynamo/output_graph.py:1340] [0/0] [__graph_code] return (to,) V0922 20:33:25.394000 679839 torch/_dynamo/output_graph.py:1340] [0/0] [__graph_code] V0922 20:33:25.394000 679839 torch/_dynamo/output_graph.py:1340] [0/0] [__graph_code] V0922 20:33:25.404000 679839 torch/_functorch/_aot_autograd/dispatch_and_compile_graph.py:114] [0/0] [__aot_graphs] aot_config id: 0, fw_metadata=ViewAndMutationMeta(input_info=[InputAliasInfo(is_leaf=True, mutates_data=False, mutates_metadata=False, mutations_hidden_from_autograd=True, mutations_under_no_grad_or_inference_mode=False, mutation_inductor_storage_resize=False, mutates_storage_metadata=False, requires_grad=False, keep_input_mutations=True)], output_info=[OutputAliasInfo(output_type=<OutputType.non_alias: 1>, raw_type=<class 'torch._subclasses.functional_tensor.FunctionalTensor'>, base_idx=None, dynamic_dims=set(), requires_grad=False, functional_tensor=None)], num_intermediate_bases=0, keep_input_mutations=True, traced_tangents=[], subclass_inp_meta=[0], subclass_fw_graph_out_meta=[0], subclass_tangent_meta=[], is_train=False, traced_tangent_metas=None, num_symints_saved_for_bw=None, grad_enabled_mutation=None, deterministic=None, static_input_indices=[], tokens={}, indices_of_inputs_that_requires_grad_with_mutations_in_bw=[], bw_donated_idxs=None, num_backward_tokens=0),subclass_metadata=None I0922 20:33:25.409000 679839 torch/_functorch/_aot_autograd/dispatch_and_compile_graph.py:204] [0/0] [__aot_graphs] TRACED GRAPH I0922 20:33:25.409000 679839 torch/_functorch/_aot_autograd/dispatch_and_compile_graph.py:204] [0/0] [__aot_graphs] ===== Forward graph 0 ===== I0922 20:33:25.409000 679839 torch/_functorch/_aot_autograd/dispatch_and_compile_graph.py:204] [0/0] [__aot_graphs] /home/niklasz/Desktop/pytorch/torch/fx/_lazy_graph_module.py class <lambda>(torch.nn.Module): I0922 20:33:25.409000 679839 torch/_functorch/_aot_autograd/dispatch_and_compile_graph.py:204] [0/0] [__aot_graphs] def forward(self, arg0_1: "f32[3, 4][4, 1]cpu"): I0922 20:33:25.409000 679839 torch/_functorch/_aot_autograd/dispatch_and_compile_graph.py:204] [0/0] [__aot_graphs] # File: /home/niklasz/Desktop/pytorch/temp/reproduction.py:4 in fn, code: return x.to(device="cuda", non_blocking=True) I0922 20:33:25.409000 679839 torch/_functorch/_aot_autograd/dispatch_and_compile_graph.py:204] [0/0] [__aot_graphs] device_put: "f32[3, 4][4, 1]cuda:0" = torch.ops.prims.device_put.default(arg0_1, device(type='cuda', index=0), True); arg0_1 = None I0922 20:33:25.409000 679839 torch/_functorch/_aot_autograd/dispatch_and_compile_graph.py:204] [0/0] [__aot_graphs] convert_element_type: "f32[3, 4][4, 1]cuda:0" = torch.ops.prims.convert_element_type.default(device_put, torch.float32); device_put = None I0922 20:33:25.409000 679839 torch/_functorch/_aot_autograd/dispatch_and_compile_graph.py:204] [0/0] [__aot_graphs] return (convert_element_type,) I0922 20:33:25.409000 679839 torch/_functorch/_aot_autograd/dispatch_and_compile_graph.py:204] [0/0] [__aot_graphs] I0922 20:33:25.409000 679839 torch/_functorch/_aot_autograd/dispatch_and_compile_graph.py:204] [0/0] [__aot_graphs] V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1134] [0/0] [__output_code] Output code written to: /tmp/torchinductor_niklasz/ha/chaai264g6ribfw3q2qhl6ayjtaqaavku5wivxtzw4nabgd6htsv.py V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] Output code: V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] # AOT ID: ['0_inference'] V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] from ctypes import c_void_p, c_long, c_int V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] import torch V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] import math V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] import random V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] import os V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] import tempfile V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] from math import inf, nan V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] from torch._inductor.hooks import run_intermediate_hooks V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] from torch._inductor.utils import maybe_profile V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] from torch._inductor.codegen.memory_planning import _align as align V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] from torch import device, empty_strided V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] from torch._inductor.async_compile import AsyncCompile V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] from torch._inductor.select_algorithm import extern_kernels V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] from torch._inductor.codegen.multi_kernel import MultiKernelCall V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] aten = torch.ops.aten V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] inductor_ops = torch.ops.inductor V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] _quantized = torch.ops._quantized V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] assert_size_stride = torch._C._dynamo.guards.assert_size_stride V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] empty_strided_cpu = torch._C._dynamo.guards._empty_strided_cpu V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] empty_strided_cuda = torch._C._dynamo.guards._empty_strided_cuda V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] empty_strided_xpu = torch._C._dynamo.guards._empty_strided_xpu V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] reinterpret_tensor = torch._C._dynamo.guards._reinterpret_tensor V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] alloc_from_pool = torch.ops.inductor._alloc_from_pool V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] async_compile = AsyncCompile() V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] async_compile.wait(globals()) V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] del async_compile V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] def call(args): V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] arg0_1, = args V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] args.clear() V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] assert_size_stride(arg0_1, (3, 4), (4, 1)) V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] with torch.cuda._DeviceGuard(0): V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] torch.cuda.set_device(0) V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] buf0 = empty_strided_cuda((3, 4), (4, 1), torch.float32) V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] buf0.copy_(arg0_1, True) V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] del arg0_1 V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] return (buf0, ) V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] def benchmark_compiled_module(times=10, repeat=10): V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] from torch._dynamo.testing import rand_strided V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] from torch._inductor.utils import print_performance V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] arg0_1 = rand_strided((3, 4), (4, 1), device='cpu', dtype=torch.float32) V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] fn = lambda: call([arg0_1]) V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] return print_performance(fn, times=times, repeat=repeat) V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] if __name__ == "__main__": V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] from torch._inductor.wrapper_benchmark import compiled_module_main V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] compiled_module_main('None', benchmark_compiled_module) V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] ``` See above line `buf0.copy_(arg0_1, True)`. Specific log setting used: `export TORCH_LOGS="graph_code,aot_graphs,output_code"` Pull Request resolved: https://github.com/pytorch/pytorch/pull/136513 Approved by: https://github.com/eellison	2024-10-01 00:32:47 +00:00
Shen Xu	19a4d68224	Add missing mappings to support torch.uint16 in quantization and export (#136547 ) Test Plan: CI. Differential Revision: D63142844 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136547 Approved by: https://github.com/angelayi	2024-10-01 00:01:01 +00:00
eellison	18e707645c	Substitute unbacked symints in expressions (#137020 ) Differential Revision: [D63647095](https://our.internmc.facebook.com/intern/diff/D63647095) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137020 Approved by: https://github.com/ezyang	2024-09-30 23:07:22 +00:00
PyTorch MergeBot	af64c44b56	Revert "Don't uselessly recompute axiom dict every static eval call (#135429 )" This reverts commit 1d6e0412f5205b1cd709e034526d7f21d6f2d56f. Reverted https://github.com/pytorch/pytorch/pull/135429 on behalf of https://github.com/ezyang due to try again ([comment](https://github.com/pytorch/pytorch/pull/135429#issuecomment-2384288879))	2024-09-30 22:29:13 +00:00
Dan Zimmerman	c07ebaf430	[triton] Try to use triton.language.extra.libdevice when possible (#136997 ) Summary: X-link: https://github.com/facebookresearch/generative-recommenders/pull/90 In view of https://github.com/triton-lang/triton/pull/3825 we should try to use `triton.language.extra.libdevice` instead of `triton.language.extra.cuda.libdevice`. Test Plan: CI Reviewed By: bertmaher, karthik-man Differential Revision: D63583965 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136997 Approved by: https://github.com/bertmaher	2024-09-30 21:52:44 +00:00
Dan Zimmerman	b3972ee19a	[triton] Unify build_paths.py for NV & AMD, fix typing (#136952 ) Summary: Some build improvements. Test Plan: CI Differential Revision: D63583959 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136952 Approved by: https://github.com/bertmaher	2024-09-30 21:51:45 +00:00
PyTorch MergeBot	66a269afe8	Revert "Format torch.fx.experimental.symbolic_shapes with PYFMT (#136934 )" This reverts commit cf1a7eab250ea37ca8fda0327e8e38693c3c5c1a. Reverted https://github.com/pytorch/pytorch/pull/136934 on behalf of https://github.com/ezyang due to merge conflict revert ([comment](https://github.com/pytorch/pytorch/pull/136934#issuecomment-2384195881))	2024-09-30 21:44:44 +00:00
PyTorch MergeBot	c94536ae74	Revert "Format torch.fx.experimental.validator (#136935 )" This reverts commit 377e4bc877a3ac4cd6d073aa513a309159ade991. Reverted https://github.com/pytorch/pytorch/pull/136935 on behalf of https://github.com/ezyang due to merge conflict revert ([comment](https://github.com/pytorch/pytorch/pull/136934#issuecomment-2384195881))	2024-09-30 21:44:44 +00:00
PyTorch MergeBot	8982906502	Revert "Turn on type-checking in torch.fx.experimental.symbolic_shapes (#136972 )" This reverts commit 3ff2d93d9f72fd26503ef0cf5c5956edad4c52e6. Reverted https://github.com/pytorch/pytorch/pull/136972 on behalf of https://github.com/ezyang due to need to back out for merge conflict ([comment](https://github.com/pytorch/pytorch/pull/136972#issuecomment-2384182244))	2024-09-30 21:35:08 +00:00
abhishek-fujitsu	b825848d85	Fix aarch64 debug build with GCC (#136990 ) Fixes #136440 Issue: When building PyTorch in debug mode on aarch64 architecture using GCC, we encounter relocation errors due to the R_AARCH64_CALL26 relocation limit. This occurs because debug builds with -O0 optimization generate larger code sizes, potentially exceeding the range limit for these relocations. Fix: Apply -Og optimization instead of -O0 for aarch64 GCC debug builds. This slightly reduces code size while maintaining debuggability, bringing function calls back within the range of R_AARCH64_CALL26 relocations. The fix is implemented by conditionally setting compiler and linker flags in CMakeLists.txt: - For aarch64 GCC debug builds: use -Og - For all other debug builds: retain -O0 This change affects only debug builds on aarch64 with GCC, leaving other configurations unchanged. Testing: Verified that the build succeeds without relocation errors on aarch64 systems with GCC in debug mode. Ensured that debugging information is still available and useful for debugging purposes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136990 Approved by: https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2024-09-30 21:11:50 +00:00
Andrew Gu	866a64ce9a	[FSDP2] Added check for contiguous parameters (#137000 ) Since our implementation currently assumes contiguous strides, let us add an explicit check and raise an error at construction time if the parameter is not contiguous. We can try to support this in the future. Mainly, I want to first learn more about how DTensor support for non-contiguous memory formats works. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137000 Approved by: https://github.com/weifengpy	2024-09-30 21:10:47 +00:00
PyTorch MergeBot	66e3186a48	Revert "Init threadpool with user defined num_threads before default (#136793 )" This reverts commit adbcaee950afa6697c04962096344bf0962a542f. Reverted https://github.com/pytorch/pytorch/pull/136793 on behalf of https://github.com/janeyx99 due to Caused internal Oculus crash, and internal force landed a diff without exporting to GH =.= ([comment](https://github.com/pytorch/pytorch/pull/136793#issuecomment-2384148132))	2024-09-30 21:10:12 +00:00
Nikita Shulga	bc6adb9596	[EZ][BE] Delete `ISSUE_TEMPALTE.md` (#137040 ) As it has been superseded by [ISSUES_TEMPLATE](https://github.com/pytorch/pytorch/tree/main/.github/ISSUE_TEMPLATE) folder, per https://docs.github.com/en/communities/using-templates-to-encourage-useful-issues-and-pull-requests/configuring-issue-templates-for-your-repository#creating-issue-forms Pull Request resolved: https://github.com/pytorch/pytorch/pull/137040 Approved by: https://github.com/ZainRizvi	2024-09-30 21:04:32 +00:00
Zain Rizvi	d46ebcb31b	Enable experiments for protected branches (#136785 ) This is to allow the protected branches (like `main` and `nightly`) also run on the LF fleet, now that we've migrated over Pull Request resolved: https://github.com/pytorch/pytorch/pull/136785 Approved by: https://github.com/jeanschmidt	2024-09-30 20:58:28 +00:00
PyTorch MergeBot	2ef1454189	Revert "Add int1 to int7 dtypes (#136301 )" This reverts commit bfa16a161d5089a9ba008f5e665f29b58dc16526. Reverted https://github.com/pytorch/pytorch/pull/136301 on behalf of https://github.com/PaliC due to causing internal failures ([comment](https://github.com/pytorch/pytorch/pull/136301#issuecomment-2384119600))	2024-09-30 20:50:49 +00:00
Howard Huang	0ccd39a64b	Fix prefix store seg fault (#136872 ) fixes https://github.com/pytorch/pytorch/issues/136723 Do not allow `None` to be passed into `PrefixStore` Pull Request resolved: https://github.com/pytorch/pytorch/pull/136872 Approved by: https://github.com/kwen2501	2024-09-30 20:43:08 +00:00
Tom Ritchford	7b96f3c75d	Fix six broken tests in test_ops.py (#136653 ) ## The problem. [A commit from three weeks ago](`82d00acfee`) appears to have broken five tests but was not caught by CI. [A later commit](https://github.com/pytorch/pytorch/commit/e05ea2b1797) which added a decomposition of `transpose_copy` added another broken test, also seemingly not detected, making six total (listed below). They came to my attention when I updated some pending decomposition pull requests which passed CI, and started getting failures like [this](https://hud.pytorch.org/pr/134319) for a test unrelated to any of these pull requests, `TestCommonCPU.test_out__refs_transpose_copy_cpu_float32` Running `python test/test_ops.py -k _copy` on `viable/strict` found failures for six `_refs` ops: `copysign`, `expand_copy`, `index_copy`, `t_copy`, `transpose_copy`, `view_copy` ## The solution The original commit did actually cause breakage by slightly changing user-visible behavior (in a special case involving scalar tensors being copied between different devices). This pull request fixes that breakage in a reasonable way, but I don't understand why this error didn't appear in CI until I made later changes in the same area. ## To reproduce To reproduce the six cases in your own client: ``` PYTORCH_OPINFO_SAMPLE_INPUT_INDEX=5 python test/test_ops.py TestCommonCPU.test_out__refs_view_copy_cpu_float32 PYTORCH_OPINFO_SAMPLE_INPUT_INDEX=2 python test/test_ops.py TestCommonCPU.test_out__refs_t_copy_cpu_float32 PYTORCH_OPINFO_SAMPLE_INPUT_INDEX=0 python test/test_ops.py TestCommonCPU.test_out__refs_index_copy_cpu_float32 PYTORCH_OPINFO_SAMPLE_INPUT_INDEX=7 python test/test_ops.py TestCommonCPU.test_out__refs_expand_copy_cpu_float32 PYTORCH_OPINFO_SAMPLE_INPUT_INDEX=0 python test/test_ops.py TestCommonCPU.test_out__refs_copysign_cpu_float32 PYTORCH_OPINFO_SAMPLE_INPUT_INDEX=4 python test/test_ops.py TestCommonCPU.test_out__refs_transpose_copy_cpu_float32 ``` @amjames Pull Request resolved: https://github.com/pytorch/pytorch/pull/136653 Approved by: https://github.com/zou3519	2024-09-30 20:32:55 +00:00
Jez Ng	71aac59e93	Add Triton CPU as an Inductor backend (#133408 ) The goal is to use Inductor-generated kernels to stress test the new Triton CPU backend. Differential Revision: [D63298968](https://our.internmc.facebook.com/intern/diff/D63298968) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133408 Approved by: https://github.com/jansel, https://github.com/blaine-rister, https://github.com/malfet	2024-09-30 20:24:52 +00:00
Laith Sakka	dfe1d45332	Enable tracing through auot_functionalized_v2 in compiled autograd (#136806 ) auto_functionalize_v2 will be the same as auto_functionalize except that args will have some more constants, or symints, and tensors are in one of the input list args. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136806 Approved by: https://github.com/zou3519	2024-09-30 19:16:13 +00:00
PyTorch MergeBot	1ef5d4cdde	Revert "Allow parallelize_module to get device_mesh from ambient context (#134247 )" This reverts commit 80e7478cc84919a48770ad85d6118294776fca73. Reverted https://github.com/pytorch/pytorch/pull/134247 on behalf of https://github.com/malfet due to Broke lint, which one can clearly see in PR CI https://github.com/pytorch/pytorch/actions/runs/11112138513/job/30873604386 ([comment](https://github.com/pytorch/pytorch/pull/134247#issuecomment-2383952449))	2024-09-30 19:07:01 +00:00
Nikita Shulga	4af03e54b7	[MPS][BE] Use `None` as alias for all types (#137004 ) Test like `new_` and `empty_` fail the current implementation, see Pull Request resolved: https://github.com/pytorch/pytorch/pull/137004 Approved by: https://github.com/Skylion007 ghstack dependencies: #136981, #136982, #136983, #136984, #136985, #136986, #137003	2024-09-30 19:06:13 +00:00
Nikita Shulga	c610aa80dc	Testing: Unblock `new_*` testing on MPS (#137003 ) By changing `other_dtype` to `torch.half` rather than `double` in `sample_inputs_new_fns` if MPS is available Pull Request resolved: https://github.com/pytorch/pytorch/pull/137003 Approved by: https://github.com/Skylion007 ghstack dependencies: #136981, #136982, #136983, #136984, #136985, #136986	2024-09-30 19:06:12 +00:00
Ke Wen	80e7478cc8	Allow parallelize_module to get device_mesh from ambient context (#134247 ) This PR is for supporting calling `parallelize_module` from within a model definition, making the model a parallel one. Calling `parallelize_module` is an alternative to maintaining a set of `ColumnWiseLinear`, `RowWiseLinear`, etc, while still being able to directly author a parallel model. (The motivation for authoring a parallel model is that there may be other distributed operations, which may not be easily captured by any module, see the forward function below. Alternatively speaking, the purpose is to exploit the expressiveness of DTensor -- we need to first create DTensors before calling ops on them. Having parallelized modules in model is one way of creating DTensors.) For example: ``` class FeedForward(nn.Module): def __init__(self, config: TransformerArgs) -> None: super().__init__() w1 = nn.Linear(config.dim, config.hidden_dim, bias=False) w2 = nn.Linear(config.hidden_dim, config.dim, bias=False) w3 = nn.Linear(config.dim, config.hidden_dim, bias=False) self.w1 = parallelize_module(w1, Colwise) self.w2 = parallelize_module(w2, Rowwise) self.w3 = parallelize_module(w3, Colwise) def forward(self, x: Tensor) -> Tensor: y: DTensor = self.w2(F.silu(self.w1(x)) * self.w3(x)) # y is a DTensor with Partial placement; we can return it as is. return y # Or we can convert it to Replicate -- there is modeling flexibility here. return y.redistribute(Replicate()) with device_mesh: model = FeedForward(config) # Now model is a model parallelized onto device_mesh y = model(x) ``` The `device_mesh` actually used for `parallelize_module` would be retrieved from the ambient context. Calling `parallelize_module` from within model hierarchy also saves the use of FQNs as in the out-of-model annotation case. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134247 Approved by: https://github.com/tianyu-l	2024-09-30 18:42:06 +00:00
Nikita Shulga	40f80a70fa	Fix lint (#137023 ) By migrating some of the workflows to Python-3.9 as 3.8 has been deprecated by https://github.com/pytorch/pytorch/pull/132138 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137023 Approved by: https://github.com/ZainRizvi, https://github.com/janeyx99, https://github.com/seemethere, https://github.com/kit1980, https://github.com/Skylion007	2024-09-30 18:29:02 +00:00
Quinn Zhu	d33638588e	[aoti][inplace] Support skipping model buffers (#136770 ) Summary: Some AOTI tensor constants may be model buffers that never needs to be updated. Differential Revision: D62777502 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136770 Approved by: https://github.com/muchulee8	2024-09-30 18:28:42 +00:00
Edward Z. Yang	3ff2d93d9f	Turn on type-checking in torch.fx.experimental.symbolic_shapes (#136972 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/136972 Approved by: https://github.com/Skylion007 ghstack dependencies: #136917, #136934, #136935	2024-09-30 18:04:36 +00:00
Sahan Paliskara	475a8a4e0c	Update ci-sev.md to make merge blocking not the default	2024-09-30 10:53:31 -07:00
Nikita Shulga	76a57568de	Update windows maintainers (#136901 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136901 Approved by: https://github.com/albanD	2024-09-30 16:12:49 +00:00
Nikita Shulga	ae3d5ed589	[MPS] Enable `nan_to_num` for bfloat16 (#136986 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136986 Approved by: https://github.com/Skylion007 ghstack dependencies: #136981, #136982, #136983, #136984, #136985	2024-09-30 16:09:44 +00:00
Nikita Shulga	d8d3aeae59	[MPS] Enable Renorm for bfloat16 (#136985 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136985 Approved by: https://github.com/Skylion007 ghstack dependencies: #136981, #136982, #136983, #136984	2024-09-30 16:09:44 +00:00
Nikita Shulga	538fcd7579	[MPS] Enable `torch.linalg.cross` for bfloat16 (#136984 ) By adding explicit instantiation. Tested in https://github.com/pytorch/pytorch/pull/136987 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136984 Approved by: https://github.com/Skylion007 ghstack dependencies: #136981, #136982, #136983	2024-09-30 16:09:40 +00:00
PyTorch MergeBot	c13c7e11c5	Revert "[Inductor] Pick ISA for inductor based on ATEN_CPU_CAPABILITY (#123514 )" This reverts commit 6931c1644afdba53e63ce5671455e4e1b7265dd9. Reverted https://github.com/pytorch/pytorch/pull/123514 on behalf of https://github.com/huydhn due to Sorry for reverting your change but its test_cpu_repro test is failing in trunk `6931c1644a` ([comment](https://github.com/pytorch/pytorch/pull/123514#issuecomment-2383563919))	2024-09-30 15:47:04 +00:00
Nikita Shulga	33d3d6e42a	[MPS] Enable bucketization for bfloat16 (#136983 ) By simply adding explicit instantiation Tested in https://github.com/pytorch/pytorch/pull/136987 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136983 Approved by: https://github.com/Skylion007 ghstack dependencies: #136981, #136982	2024-09-30 14:45:57 +00:00
Nikita Shulga	3ed2969889	[MPS] Extend `fmin`/`fmax`/`copysign` and `nextafter` to blfoat (#136982 ) Just adds instantiation of the kernels and sometimes explicit cast. Tested in https://github.com/pytorch/pytorch/pull/136987 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136982 Approved by: https://github.com/Skylion007 ghstack dependencies: #136981	2024-09-30 14:45:57 +00:00
Nikita Shulga	797092b263	[MPS] Fix Gamma for bfloat16 dtypes (#136981 ) Before this change, test failed with unable to compile errors, as `bfloat16` requires explicit cast. Tested in https://github.com/pytorch/pytorch/pull/136987 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136981 Approved by: https://github.com/Skylion007	2024-09-30 14:45:52 +00:00
Bin Bao	a15f3f51bc	[AOTI] Update sam_fast from timeout to fail_to_run (#136996 ) Summary: sam_fast changes from timeout to fail_to_run after https://github.com/pytorch/pytorch/pull/136591, which "regressed" in a good way. Update the expected result file and continue investigating. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136996 Approved by: https://github.com/ezyang	2024-09-30 14:05:49 +00:00
Brian Hirsh	c010c6099b	compile time benchmarks for AOTDispatcher (partitioner) (#136760 ) compile time benchmark for the min cut partitioner. I'm hoping that this is a reasonable benchmark because: (1) it consists of a single input + many weights that are used sequentially (2) contains a mix of recompute vs non-recomputed ops (matmul + sin) (3) it is relatively simple from running locally: ``` collecting compile time instruction count for aotdispatcher_partitioner_cpu compile time instruction count for iteration 0 is 21764219181 compile time instruction count for iteration 1 is 12475020009 compile time instruction count for iteration 2 is 12463710140 compile time instruction count for iteration 3 is 12455676489 compile time instruction count for iteration 4 is 12451344330 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/136760 Approved by: https://github.com/ezyang ghstack dependencies: #136670, #136759	2024-09-30 13:25:02 +00:00
Brian Hirsh	b17cd264d3	compile time benchmarks for AOTDispatcher (inference/training/subclasses) (#136759 ) this adds a few compile time benchmarks for some disjoint paths in AOTDispatcher: (1) inference vs training code paths (2) "subclasses" vs "no subclasses" codepaths Also see https://github.com/pytorch/pytorch/pull/136760 for a partitioner benchmark (I'm not sure why ghstack didn't display the stack nicely) I ran locally, and got these numbers on the 4 paths: ``` collecting compile time instruction count for aotdispatcher_inference_nosubclass_cpu compile time instruction count for iteration 0 is 11692348671 compile time instruction count for iteration 1 is 3026287204 compile time instruction count for iteration 2 is 3011467318 compile time instruction count for iteration 3 is 3004485935 compile time instruction count for iteration 4 is 3003087410 collecting compile time instruction count for aotdispatcher_training_nosubclass_cpu compile time instruction count for iteration 0 is 6068003223 compile time instruction count for iteration 1 is 5585418102 compile time instruction count for iteration 2 is 5581856618 compile time instruction count for iteration 3 is 5581651794 compile time instruction count for iteration 4 is 5578742619 collecting compile time instruction count for aotdispatcher_inference_subclass_cpu compile time instruction count for iteration 0 is 8634984264 compile time instruction count for iteration 1 is 8633467573 compile time instruction count for iteration 2 is 8632182092 compile time instruction count for iteration 3 is 8632056925 compile time instruction count for iteration 4 is 8632543871 collecting compile time instruction count for aotdispatcher_training_subclass_cpu compile time instruction count for iteration 0 is 14737239311 compile time instruction count for iteration 1 is 14734346427 compile time instruction count for iteration 2 is 14736493730 compile time instruction count for iteration 3 is 14734121272 compile time instruction count for iteration 4 is 14733852882 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/136759 Approved by: https://github.com/laithsakka ghstack dependencies: #136670	2024-09-30 13:25:02 +00:00
Brian Hirsh	dfdda2f6a6	inductor: use previous guards to know if a size is 1 for broadcasting (#136670 ) Fixes https://github.com/pytorch/pytorch/issues/136640 Today, inductor has some logic to figure out when it needs to do broadcasting during lowering, which just checks if any of the input shapes have sizes equal to 1. In particular: we should already have this information by the time we get to inductor, because our FakeTensor compute will have branched/guarded on whether any ops performed broadcasting, appropriately. In particular, if we have a tensor with a size value of `(64//((2048//(s3((s2//s3)))))))`, and it happens to be equal to one (and it is used in an op that requires this dim to be broadcasted), FakeTensorProp will have generated a guard: ``` Eq((64//((2048//(s3((s2//s3))))))), 1) ``` I chose the simplest possible way to beef up inductor's checks to know when a given size is equal to 1: loop over the existing shape env guards, and if our current size is a sympy expression on the LHS of one of our `Eq(LHS, 1)` guards, then return True. I'm hoping for feedback on whether or not this approach is reasonable. One better option I could imagine is that our symbolic reasoning should have automatically simplified the size of our tensor down to a constant as part of evaluating that guard. I was originally going to try to do this directly in the shape env, but I ran into a few issues: (1) I wanted to call some version of `set_replacement(expr, 1)`. But `set_replacement()` only accepts plain symbols on the LHS, not expressions (2) in theory I could get this to work if I could rework the above expression to move everything that is not a free variable to the RHS, e.g. `Eq(s2, 32)`. It looks like our existing `try_solve()` logic is... [not quite able](https://github.com/pytorch/pytorch/blob/main/torch/utils/_sympy/solve.py#L27) to do this generally though. Checking the guards feels pretty simple-and-easy. Are we worried that it is too slow to iterate over all the guards? I could also cache the lookup so we only need to iterate over guards that are of the form `Eq(LHS, 1)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/136670 Approved by: https://github.com/ezyang	2024-09-30 13:24:57 +00:00
cyy	05b15dba7e	[1/N] Fix clang-tidy warnings in torch/csrc/api/ (#134545 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/134545 Approved by: https://github.com/ezyang	2024-09-30 09:06:30 +00:00
Bin Bao	d6d9183456	[Inductor] Switch cpp_wrapper tests to ABI-compatible (#136904 ) Summary: Switch test_cpu_cpp_wrapper and test_cuda_cpp_wrapper to test the ABI-compatible mode only. Fixed a missing Py_NewRef issue for python 3.9. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136904 Approved by: https://github.com/Yoggie9477, https://github.com/chenyang78	2024-09-30 05:44:52 +00:00
Bin Bao	ad8fae2aa9	[AOTI] Support test_open_device_registration in ABI-compatible (#136906 ) Summary: Add a device type C shim interface to support test_open_device_registration in the ABI-compatible mode. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136906 Approved by: https://github.com/chenyang78	2024-09-30 05:08:16 +00:00
Aaron Gokaslan	8dddd45679	[BE][Ez]: Update cudnn_frontend submodule to v1.7.0 (#136920 ) Updates cudnn frontend submodule to v1.7.0 which has some bugfixes and a couple new features. https://github.com/NVIDIA/cudnn-frontend/releases/tag/v1.7.0 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136920 Approved by: https://github.com/ezyang	2024-09-30 02:50:16 +00:00
Thomas	80393c90b3	docs: clarify alias usage for `x` parameter in vector_norm function (#136921 ) - Added a note in the documentation specifying that the `input` parameter can be used as an alias for `x`. Fixes #136560 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136921 Approved by: https://github.com/ezyang Co-authored-by: Edward Z. Yang <ezyang@meta.com>	2024-09-30 02:50:06 +00:00
Edward Z. Yang	377e4bc877	Format torch.fx.experimental.validator (#136935 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/136935 Approved by: https://github.com/Skylion007 ghstack dependencies: #136917, #136934	2024-09-30 02:20:40 +00:00
Edward Z. Yang	cf1a7eab25	Format torch.fx.experimental.symbolic_shapes with PYFMT (#136934 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/136934 Approved by: https://github.com/Skylion007 ghstack dependencies: #136917	2024-09-30 02:20:40 +00:00
xinan.lin	0a26851601	[Inductor] Handle device property `warp_size` is None but used on XPU. (#136834 ) Fix #136820 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136834 Approved by: https://github.com/EikanWang, https://github.com/jansel	2024-09-30 02:08:45 +00:00
CaoE	6931c1644a	[Inductor] Pick ISA for inductor based on ATEN_CPU_CAPABILITY (#123514 ) It is part of https://github.com/pytorch/pytorch/issues/123224. Pick ISA based on the environment ATEN_CPU_CAPABILITY to control CPU vec ISA level for Inductor like eager. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123514 Approved by: https://github.com/jgong5, https://github.com/peterbell10	2024-09-30 00:53:18 +00:00
Edward Z. Yang	9dbc6bacff	Propagate detailed location information of shape guards to guards/recompiles output (#136917 ) To see the payoff, look at test/dynamo/test_logging.py The general idea is to refactor produce_guards into produce_guards_verbose which also returns verbose code parts, which have our annotations. The rest of the logic is plumbing around SLocs to the places they need to be so we can print them. Guards are easy; value ranges and duck sizing take more care. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/136917 Approved by: https://github.com/anijain2305	2024-09-30 00:43:12 +00:00
Laith Sakka	e205193e1c	Enable failing diffs on regression (#136551 ) 1. example of failing diff https://github.com/pytorch/pytorch/pull/136740 2. test this by running python check_results.py test_check_result/expected_test.csv test_check_result/result_test.csv results ``` WIN: benchmark ('a', ' instruction count') failed, actual result 90 is 18.18% lower than expected 110 ±1.00% please update the expected results. REGRESSION: benchmark ('b', ' memory') failed, actual result 200 is 100.00% higher than expected 100 ±10.00% if this is an expected regression, please update the expected results. MISSING REGRESSION TEST: benchmark ('d', ' missing-test') does not have a regression test enabled for it ``` MISSING REGRESSION TEST does not fail but its logged. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136551 Approved by: https://github.com/ezyang ghstack dependencies: #136383	2024-09-29 22:31:26 +00:00
Jeff Daily	d33a5e2a57	[ROCm] fastSpecializedAtomicAdd for MI300 (#135770 ) MI300 adds HW support for packed bfloat16 and fp16. Enable via existing fastSpecializedAtomicAdd. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135770 Approved by: https://github.com/xw285cornell, https://github.com/jianyuh	2024-09-29 21:52:09 +00:00
fduwjj	c9653bf2ca	[Elasitc][fix] Use the right env variable TORCH_ELASTIC_WORKER_IDENTICAL for unit test (#136916 ) as title, this is an easy fix for unit test. Differential Revision: [D63577774](https://our.internmc.facebook.com/intern/diff/D63577774/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136916 Approved by: https://github.com/wz337 ghstack dependencies: #136865	2024-09-29 03:55:10 +00:00
cyy	156ca01e51	Enable clang-tidy on torch/csrc/lazy (#136851 ) Enable clang-tidy on torch/csrc/lazy Pull Request resolved: https://github.com/pytorch/pytorch/pull/136851 Approved by: https://github.com/Skylion007	2024-09-28 21:16:40 +00:00
Aaron Gokaslan	d3c2123ea6	[BE][CUDA][Bugfix]: Enable extended MMA shapes in CUTLASS. (#133686 ) * This fixes a major CMake/Bazel configuration bug where we were leaving CUTLASS performance on the table, especially with FlashAttention. This now enables using MMA instructions on SM90+, which should close the gap between SDPA and the external FA2. Note these operations only affect H100 and newer GPUs. Thankfully, this seems to have been updated recently into being a noop on the CUTLASS side. Still better set the CMake variable properly. * Also enables additional new shape kernels added in the recent CUTLASS 3.5.1+ update. This was the original motivatin of the PR before I realized the basic MMA kernels were accidentally disabled since we didn't go through the submodule's CMake/Bazels. * Adds a bit to compile time and code size, but well worth it considering it speeds up our internal flash attention significantly on H100s at the cost of some minor additional compile time. * These kernels and settings will be needed for Flash Attention 3 whenever we add that too. Fixes #133695 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133686 Approved by: https://github.com/ezyang	2024-09-28 21:11:15 +00:00
Edward Z. Yang	1d6e0412f5	Don't uselessly recompute axiom dict every static eval call (#135429 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/135429 Approved by: https://github.com/isuruf	2024-09-28 20:59:59 +00:00
FFFrog	6ecb73bafd	Limit the option value of TORCH_SHOW_DISPATCH_TRACE (#136510 ) It`s more convenient for user to enable or disable dispatch trace by setting TORCH_SHOW_DISPATCH_TRACE to 1 or 0, especially debug in IDE. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136510 Approved by: https://github.com/shink, https://github.com/ezyang	2024-09-28 20:59:05 +00:00
Boyuan Feng	28224329ad	[Flex Attention] fix block size order (#136657 ) `create_block_mask` currently gives wrong BLOCK_SIZE and shape when using non-default block size `(128,128)`. This PR fixes the issue by using BLOCK_SIZE order `(Q_BLOCK_SIZE, KV_BLOCK_SIZE)`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136657 Approved by: https://github.com/Chillee, https://github.com/drisspg	2024-09-28 19:56:53 +00:00
Jason Ansel	cf53ab95dc	[halide-backend] Fix ops.fma codegen (#136810 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136810 Approved by: https://github.com/eellison ghstack dependencies: #136808, #136809	2024-09-28 19:26:04 +00:00
Jason Ansel	8da9c4178c	[inductor] Benchmark Halide in operatorbench.py (#136809 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136809 Approved by: https://github.com/eellison ghstack dependencies: #136808	2024-09-28 19:26:04 +00:00
atalman	a54b69279b	Bump triton pin to latest 3.1.x release branch (#136874 ) Moves pin to latest in release/3.1.x Pull Request resolved: https://github.com/pytorch/pytorch/pull/136874 Approved by: https://github.com/bertmaher, https://github.com/drisspg, https://github.com/kit1980, https://github.com/malfet	2024-09-28 13:47:07 +00:00
Ivan Zaitsev	b35f70da05	[ez] fixup the export of D62879819 (#136900 ) a line from D62879819 (#136190) went missing somehow Pull Request resolved: https://github.com/pytorch/pytorch/pull/136900 Approved by: https://github.com/atalman	2024-09-28 13:46:17 +00:00
Banit Agrawal	c4ae45104f	[PyTorch Pinned Allocaor] Move background thread init from constructor to allocate function (#136879 ) Differential Revision: D63553138 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136879 Approved by: https://github.com/zyan0	2024-09-28 07:24:44 +00:00
Jason Ansel	375921b755	[inductor] Improve operatorbench.py (#136808 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136808 Approved by: https://github.com/eellison	2024-09-28 06:22:02 +00:00
James Wu	96104db132	[easy] fix typo in debug logs for fx graph cache (#136889 ) Summary: Accidentally messed up the debug logging here, fixing typo (scuba + tlparse logging is unaffected) Test Plan: existing tests Differential Revision: D63555766 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136889 Approved by: https://github.com/oulgen	2024-09-28 03:56:09 +00:00
Shivam Raikundalia	9e4f24f8e5	Fix PT2 Source Code Annotations (#136460 ) Summary: In D60803317, we added CompileContext (trace_id) information to Kineto traces using caching when a CompileContext exits. As pointed out by some users, this gives innaccurate IDs because we are not getting the context that we is being looked up within the eval_frame. For this reason, we decided to revert that change, and go with an approach that involves getting the trace_id associated with a given CacheEntry. To do this, we add a trace_id to the GuardedCode so that it can be passed onto a CacheEntry. Then, we change the lookup function to return said trace_id alongside the code so that we can pass both into our eval function. Once we get to a Torch-Compiled Region, we can just append the context information to the name of the annotation thus bypassing any need for kwargs. Test Plan: Added more comprehensive unit test. Saw that all the trace_ids appeared within the graph. Differential Revision: D63138786 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136460 Approved by: https://github.com/ezyang	2024-09-28 03:54:43 +00:00
Nitin Jain	8df97d78c2	[QAT] Make Fused modules torchscriptable (#136285 ) Summary: Same as title. Inspired by: https://pytorch.org/tutorials/recipes/script_optimized.html#fix-common-errors-when-using-the-script-method Test Plan: CI Differential Revision: D62980019 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136285 Approved by: https://github.com/jerryzh168	2024-09-28 03:46:19 +00:00
wz337	93dcb92bae	[DeviceMesh][EZ] Add group description to new group (#136558 ) Add group description to new_group in device_mesh to help with debuggability. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136558 Approved by: https://github.com/kwen2501, https://github.com/fduwjj	2024-09-28 03:09:41 +00:00
Edward Z. Yang	99f90c379e	Simplify find_localzeros (#133325 ) Instead of doing an N^2 connected thing, only do simplifications for binary max/min, and for very simple situations. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/133325 Approved by: https://github.com/albanD	2024-09-28 02:38:31 +00:00
Jerry Zhang	bfa16a161d	Add int1 to int7 dtypes (#136301 ) Summary: Similar to https://github.com/pytorch/pytorch/pull/117208, we want to add int1 to int7 for edge use cases for weight quantization (https://www.internalfb.com/diff/D62464487) Test Plan: python test/test_quantization.py -k test_uint4_int4_dtype Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/136301 Approved by: https://github.com/ezyang	2024-09-28 02:08:33 +00:00
albanD	e4571e7025	Add abi flags to cpp_extension cache folder (#136890 ) This is to avoid cache confusion between normal vs pydebug vs nogil builds in cpp extensions which can lead to catastrophic ABI issues. This is rare today for people to run both normal and pydebug on the same machine, but we expect quite a few people will run normal and nogil on the same machine going forward. This is tested locally by running each version alternatively. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136890 Approved by: https://github.com/colesbury	2024-09-28 00:49:56 +00:00
fduwjj	f42e88fea5	[reland][Elastic] Skip store barrier and store get in host assign (#136865 ) As title this is to reland https://github.com/pytorch/pytorch/pull/136579 as it broke some OSS CI Differential Revision: [D63542918](https://our.internmc.facebook.com/intern/diff/D63542918/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136865 Approved by: https://github.com/atalman	2024-09-27 23:40:42 +00:00
David Berard	ef3142d2a0	[user triton] Make tl.constexpr specialization work for triton_op & capture_triton (#136686 ) In #136512, we fixed handling for tl.constexpr and dynamic shapes: if a symint is passed to tl.constexpr, you should specialize on it, because tl.constexpr implies needing to know the concrete value at compile time. However, when using triton_op, capture_triton, or non-strict export, the regression remains (and #136512 might technically regress some specific export scenarios) - see [Richard's comment](https://github.com/pytorch/pytorch/pull/136512/files#r1775999871). This fixes these scenarios: implement the handling differently depending on whether we're expecting a SymNodeVariable or a SymInt(/SymBool/SymFloat) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136686 Approved by: https://github.com/zou3519	2024-09-27 23:02:46 +00:00
Kunal Bhalla	9d67c31758	Cast device index to int before logging (#135405 ) int8_t = DeviceIndex is interpreted by cout as a char, which then shows up as a control character in logs (eg. ^A) etc. Explicitly casting to int to have the numbers printed out correctly. Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/135405 Approved by: https://github.com/wconstab	2024-09-27 23:01:12 +00:00
angelayi	fe158cfb47	[aoti] Add warning to ask users to switch to new API (#135893 ) Instead of the following: ``` so_path = torch._export.aot_compile(...) torch._export.aot_load(so_path) ``` The recommended path is to: ``` ep = torch.export.export(...) pt2_path = torch._inductor.aoti_compile_and_package(ep, ...) torch._inductor.package.load_package(pt2_path) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/135893 Approved by: https://github.com/desertfire	2024-09-27 22:38:11 +00:00
Jane Xu	adbcaee950	Init threadpool with user defined num_threads before default (#136793 ) Fixes #134714 (or attempts to, idk how to test yet) For posterity, how one can test: 1. make sure you have USE_PTHREADPOOL=1 or pull a packaged binary 2. run gdb --args python, with `r` to enter, `Ctrl-C` to pause, and `c` to get back into Python 3. import torch 4. torch.set_num_threads(1), make sure this does not trigger any additional threads getting created. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136793 Approved by: https://github.com/albanD	2024-09-27 22:22:37 +00:00
Jesse Cai	bc21689136	[sparse][semi-structured] Add float8 dtype support to 24 sparsity (#136397 ) Summary: This PR adds `torch.float8e4m3fn` support to cuSPARSELt and `to_sparse_semi_structured`. This will let users to run fp8 + 2:4 sparse matmuls on Hopper GPUs with cusparselt >= 0.6.2, via to `scaled_mm` API. ``` A = rand_sparse_semi_structured_mask(256, 128, dtype=torch.float16) B = torch.rand(dense_input_shape, device=device).to(torch.float16).t() A_fp8, A_scale = to_float8(A) B_fp8, B_scale = to_float8(B) dense_result = torch._scaled_mm( A_fp8, B_fp8, scale_a=A_scale, scale_b=B_scale, out_dtype=out_dtype ) A_fp8_sparse = to_sparse_semi_structured(A_fp8) sparse_result = torch._scaled_mm( A_fp8_sparse, B_fp8, scale_a=A_scale, scale_b=B_scale, out_dtype=out_dtype ) ``` Note that to keep this consistent with normal torch behavior, calling `torch.mm(A_fp8_sparse, B_fp8)` will raise a NotImplementedError. I also turned on cuSPARSELt by default and added CUSPARSELT_MAX_ID to the backend to make the tests a bit cleaner Test Plan: ``` python test/test_sparse_semi_structured -k scaled_mm python test/test_sparse_semi_structured -k fp8 ``` Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/136397 Approved by: https://github.com/drisspg	2024-09-27 21:37:34 +00:00
Oguz Ulgen	a28b40fa74	Improve is_fbcode functionality (#136871 ) Summary: Previously is_fbcode just checked whether the checkout was git or not. This is extremely error prone. Lets make it fool-proof. Test Plan: unit tests Reviewed By: masnesral Differential Revision: D63545169 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136871 Approved by: https://github.com/masnesral	2024-09-27 21:19:01 +00:00
Nikita Shulga	283bda01aa	[MPS] Error checking/bf16 support for `torch.normal` (#136863 ) Before that attempt to run something like ``` % python -c "import torch;dev,dt='mps',torch.int; print(torch.normal(mean=torch.arange(1., 11., device=dev, dtype=dt), std=torch.arange(10, 0, -1, device=dev, dtype=dt)))" ``` Resulted in hard error ``` (mpsFileLoc): /AppleInternal/Library/BuildRoots/e0873e53-5185-11ef-9a51-9ab6d782fe32/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphUtilities.mm:233:0: error: 'mps.multiply' op requires the same element type for all operands and results (mpsFileLoc): /AppleInternal/Library/BuildRoots/e0873e53-5185-11ef-9a51-9ab6d782fe32/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphUtilities.mm:233:0: note: see current operation: %5 = "mps.multiply"(%2, %arg1) : (tensor<10xf32>, tensor<10xsi32>) -> tensor<xf32> (mpsFileLoc): /AppleInternal/Library/BuildRoots/e0873e53-5185-11ef-9a51-9ab6d782fe32/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphUtilities.mm:233:0: error: 'mps.multiply' op requires the same element type for all operands and results (mpsFileLoc): /AppleInternal/Library/BuildRoots/e0873e53-5185-11ef-9a51-9ab6d782fe32/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphUtilities.mm:233:0: note: see current operation: %5 = "mps.multiply"(%2, %arg1) : (tensor<10xf32>, tensor<10xsi32>) -> tensor<xf32> /AppleInternal/Library/BuildRoots/e0873e53-5185-11ef-9a51-9ab6d782fe32/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphExecutable.mm:953: failed assertion `original module failed verification' ``` After the change, it raises a nice type error Pull Request resolved: https://github.com/pytorch/pytorch/pull/136863 Approved by: https://github.com/Skylion007 ghstack dependencies: #136754, #136755, #136821, #136822	2024-09-27 21:11:59 +00:00
PyTorch MergeBot	f7ab0e9989	Revert "[Flex Attention] fix block size order (#136657 )" This reverts commit b42f1e3641314c8dc369255b850450acddf3477c. Reverted https://github.com/pytorch/pytorch/pull/136657 on behalf of https://github.com/ZainRizvi due to Sorry, this seems to break ROCm builds. inductor/test_flex_attention.py::TestFlexAttention::test_builtin_score_mods_seqlen_lt_custom_sparse_block_size_float16_score_mod1 [GH job link](https://github.com/pytorch/pytorch/actions/runs/11069782242/job/30759299713) [HUD commit link](`b42f1e3641`) ([comment](https://github.com/pytorch/pytorch/pull/136657#issuecomment-2380031525))	2024-09-27 20:47:54 +00:00
Yifu Wang	6e70ec9aa5	[SymmetricMemory] expose the multicast_ptr (#136840 ) This allows writing triton kernels using the `multimem` ptx instructions. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136840 Approved by: https://github.com/Chillee	2024-09-27 20:41:33 +00:00
PyTorch MergeBot	f21b471978	Revert "Fix numerical instability for norm (#129352 )" This reverts commit 66340e67515cd3592bda6bdd9bfe2ffa22fe7413. Reverted https://github.com/pytorch/pytorch/pull/129352 on behalf of https://github.com/atalman due to Breaks Internal CI ([comment](https://github.com/pytorch/pytorch/pull/129352#issuecomment-2379989485))	2024-09-27 20:18:47 +00:00
Yifu Wang	d55eef5c59	[SymmetricMemory] improve multicast initialization/fallback logic (#136577 ) Fixes https://github.com/pytorch/pytorch/issues/136494 Currently, CUDASymmetricMemory::rendezvous() initializes a multicast address if multicast support is present. However, if we believe multicast support is present but cuMulticastCreate still fails for some reason, we do not fallback gracefully. - In addition to CUDART and driver version check, query CU_DEVICE_ATTRIBUTE_MULTICAST_SUPPORTED to determine multicast support for a rank/device. - Before initializing multicast for a block, ensure all ranks/devices have multicast support. - This is unlikely, but if cuMulticastCreate still fails on rank 0, print the corresponding driver error message as a warning, and gracefully skip multicast initialization for the block. - Introduced an environment variable (TORCH_SYMM_MEM_DISABLE_MULTICAST) to allow users to explicitly disable multicast support as a workaround. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136577 Approved by: https://github.com/Chillee, https://github.com/eqy	2024-09-27 20:04:21 +00:00
Wei Wang	e512eac410	Companion PR to https://github.com/pytorch/pytorch/pull/134022 (#136818 ) Note:[ cusparselt 0.6.0](https://docs.nvidia.com/cuda/cusparselt/release_notes.html#cusparselt-v0-6-0)+ supports SM90 (Hopper). Thanks @xwang233 for catching this bug while testing upstream binaries! Fixes the issues like: ``` A_compressed = torch._cslt_compress(A) RuntimeError: CUDA error: architecture mismatch when calling `cusparseLtInit(&handle)` ``` @kit1980 Could we get this cherry-picked to 2.5.0 please? Pull Request resolved: https://github.com/pytorch/pytorch/pull/136818 Approved by: https://github.com/eqy, https://github.com/jcaip, https://github.com/malfet	2024-09-27 19:57:15 +00:00
Kimish Patel	e5a57932f0	[Pytorch][AO] Update choose_qparams_per_token op to output correct shape for scales and zp (#136807 ) - also makes scales and zp dtype reconcile with meta impl as well as other quantized ops representation of scales and zero point - make sure qunatize_per_token's output_dtype is respected There are a few places where we need to reconcile on scale and zero point dtype but that will come later. This fixes are mainly being done to enable quantized kv cache though ET stack Differential Revision: [D62301840](https://our.internmc.facebook.com/intern/diff/D62301840/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136807 Approved by: https://github.com/jerryzh168	2024-09-27 18:46:17 +00:00
Pian Pawakapan	6075f566cc	[export] simplify automatic dynamic shapes processing (#136591 ) Removing `_transform_shapes_for_default_dynamic` and `assume_static_by_default=False` as added in https://github.com/pytorch/pytorch/pull/133620. This reverts back to `assume_static_by_default=True` with the use of dynamo decorators (e.g. `maybe_mark_dynamic, mark_static`, instead) for handling Dim.AUTO & Dim.STATIC instead. This is easier to maintain, as it doesn't requiring reasoning about "inverting" the dynamic_shapes specs, and also opens up usage of other decorators (`mark_dynamic, mark_unbacked`). On the user side this change has no effect, but internally this means dynamic behavior is determined only by the `dynamic_shapes` specs (ignoring user-side input decorators following https://github.com/pytorch/pytorch/pull/135536), but transferring this information for _DimHints via decorators, for Dynamo/non-strict to create symbolic_contexts accordingly, e.g. `7c6d543a5b/torch/_dynamo/variables/builder.py (L2646-L2666)` One caveat is we don't raise errors for dynamic decorators on the user side, since we don't know if they're from user markings, or from re-exporting with inputs we've previously marked. Differential Revision: D63358628 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136591 Approved by: https://github.com/avikchaudhuri	2024-09-27 18:28:51 +00:00
Bob Ren	a8b5adcdd5	add types to _dynamo/code_context.py (#136665 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136665 Approved by: https://github.com/williamwen42	2024-09-27 18:27:42 +00:00
PyTorch MergeBot	287dc36395	Revert "[user triton] Make tl.constexpr specialization work for triton_op & capture_triton (#136686 )" This reverts commit 9f5b97a0065dfc4a7978a0fdf3fac2df8aef9519. Reverted https://github.com/pytorch/pytorch/pull/136686 on behalf of https://github.com/davidberard98 due to breaks lint on main. Please rebase to see and fix the error ([comment](https://github.com/pytorch/pytorch/pull/136686#issuecomment-2379830921))	2024-09-27 18:25:49 +00:00
Mikayla Gawarecki	2208ff64ba	Fix RMSNorm doc per #136597 (#136727 ) Fixes #136597 (remove incorrect sqrt around `RMS(x)`) <img width="857" alt="Screenshot 2024-09-26 at 11 46 32 AM" src="https://github.com/user-attachments/assets/21ea26ad-bd9f-4b9b-8b60-f52a1dc16da6"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/136727 Approved by: https://github.com/albanD	2024-09-27 18:21:48 +00:00
William Wen	2157e396a3	[dynamo] attempt run only mode when dynamo cache limit is hit (#136655 ) Implement https://github.com/pytorch/pytorch/issues/135458. Try run-only mode when dynamo cache limit is hit. If no valid cache entries are found, then skip code recursively. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136655 Approved by: https://github.com/jansel	2024-09-27 17:15:05 +00:00
PyTorch MergeBot	36428f91e9	Revert "Add Triton CPU as an Inductor backend (#133408 )" This reverts commit 31c0467594c7c41c8e8ff1828bf01fa31fc4454f. Reverted https://github.com/pytorch/pytorch/pull/133408 on behalf of https://github.com/int3 due to internal tests failing ([comment](https://github.com/pytorch/pytorch/pull/133408#issuecomment-2379692517))	2024-09-27 16:54:27 +00:00
Davis Rollman	17f396b0b4	Delete project.default_flavors_mode buckconfig (#136772 ) Summary: Buck1 only buckconfig Test Plan: CI Reviewed By: JakobDegen Differential Revision: D63430482 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136772 Approved by: https://github.com/malfet	2024-09-27 16:24:50 +00:00
cyy	cbc182d2e0	Remove problematic constructor (#136708 ) Since it calls a pure virtual function and it is not used elsewhere. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136708 Approved by: https://github.com/ezyang	2024-09-27 16:16:58 +00:00
James Wu	dc8c0aaf4d	[AOTAutogradCache] Log time taken_ns (#136529 ) Summary: This diff logs the time_taken_ns for the forward and backward graphs in AOTAutogradCache, saving it into the cache entry. This information is helpful later when I remotify the cache, and also is just useful to have in tlparse and chromium events. Test Plan: Run benchmark, see that the times are in the chromium events. Reviewed By: aorenste Differential Revision: D62590077 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136529 Approved by: https://github.com/oulgen	2024-09-27 16:14:09 +00:00
David Berard	9f5b97a006	[user triton] Make tl.constexpr specialization work for triton_op & capture_triton (#136686 ) In #136512, we fixed handling for tl.constexpr and dynamic shapes: if a symint is passed to tl.constexpr, you should specialize on it, because tl.constexpr implies needing to know the concrete value at compile time. However, when using triton_op, capture_triton, or non-strict export, the regression remains (and #136512 might technically regress some specific export scenarios) - see [Richard's comment](https://github.com/pytorch/pytorch/pull/136512/files#r1775999871). This fixes these scenarios: implement the handling differently depending on whether we're expecting a SymNodeVariable or a SymInt(/SymBool/SymFloat) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136686 Approved by: https://github.com/zou3519	2024-09-27 16:11:02 +00:00
bhack	ad51995468	Add a nightly hotpatch utils for python only PR (#136535 ) I think this could help many teams, especially compile/export teams (/cc @ezyang), to let end user/bug reporters to quickly test WIP PR when reporting a related bug. This could quickly run in an official nightly Docker container or in a nightly venv/coda env. Let me know what do you think. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136535 Approved by: https://github.com/ezyang	2024-09-27 15:58:48 +00:00
Nikita Shulga	9d72f7481b	[MPS] Fix AvgPool2d for float16 (#136822 ) This was a stupid cast error that caused MPSGraph to crash with the following exception ``` (mpsFileLoc): /AppleInternal/Library/BuildRoots/e0873e53-5185-11ef-9a51-9ab6d782fe32/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphUtilities.mm:233:0: error: 'mps.multiply' op requires the same element type for all operands and results (mpsFileLoc): /AppleInternal/Library/BuildRoots/e0873e53-5185-11ef-9a51-9ab6d782fe32/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphUtilities.mm:233:0: note: see current operation: %3 = "mps.multiply"(%2, %arg1) : (tensor<1x3x9x9xf16>, tensor<1xf32>) -> tensor<xf32> (mpsFileLoc): /AppleInternal/Library/BuildRoots/e0873e53-5185-11ef-9a51-9ab6d782fe32/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphUtilities.mm:233:0: error: 'mps.multiply' op requires the same element type for all operands and results (mpsFileLoc): /AppleInternal/Library/BuildRoots/e0873e53-5185-11ef-9a51-9ab6d782fe32/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphUtilities.mm:233:0: note: see current operation: %3 = "mps.multiply"(%2, %arg1) : (tensor<1x3x9x9xf16>, tensor<1xf32>) -> tensor<xf32> /AppleInternal/Library/BuildRoots/e0873e53-5185-11ef-9a51-9ab6d782fe32/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphExecutable.mm:953: failed assertion `original module failed verification' ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/136822 Approved by: https://github.com/Skylion007 ghstack dependencies: #136754, #136755, #136821	2024-09-27 15:32:18 +00:00
Nikita Shulga	2b6f4e9e24	[BE][MPS] Delete MacOS12 low-precision ops (#136821 ) `norm` and `masked.normalize` still have to stay in the list Pull Request resolved: https://github.com/pytorch/pytorch/pull/136821 Approved by: https://github.com/Skylion007 ghstack dependencies: #136754, #136755	2024-09-27 15:32:18 +00:00
Sam Larsen	45a8b5682e	[inductor] Triton codegen: Use scalar when creating f64 constant instead of 1-element tensor (#136858 ) This is a retry of https://github.com/pytorch/pytorch/pull/136594, which is having trouble landing. Summary: We have an internal report of a Triton compiler error `ValueError: Cannot broadcast, rank mismatch: [1], [1, 2048]` coming from a line like this: `tmp25 = tl.broadcast_to(((tl.full([1], 1.00000000000000, tl.float64)) + ((ks0 // 3278).to(tl.float64))) / (((tl.full([1], 0.500000000000000, tl.float64))(libdevice.sqrt((1 + ((ks0 // 3278)(ks0 // 3278)) + ((-2)(ks0 // 3278))).to(tl.float64).to(tl.float32)))) + ((tl.full([1], 0.500000000000000, tl.float64))((1 + (ks0 // 3278)).to(tl.float64)))), [XBLOCK, RBLOCK])` https://github.com/pytorch/pytorch/pull/135260 is the cause, presumably because we turn a constant into a 1-element tensor with: `(tl.full([1], const, tl.float64))`. It looks like changing the syntax to `(tl.full([], const, tl.float64))` gives us what we want? Differential Revision: [D63540693](https://our.internmc.facebook.com/intern/diff/D63540693) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136858 Approved by: https://github.com/atalman	2024-09-27 15:14:12 +00:00
IvanKobzarev	34d788ffb0	[aotd] Do not force contiguous() for channels_last (#135225 ) Original Issue: https://github.com/pytorch/pytorch/issues/134644 We assume trace_tangents to have the same memory_format as inputs, outputs, intermediate during first tracing. => Tracing time: - Store trace_tangents_memory_formats in metadata - Coerce tangents to deduced memory_format Runtime: - Coerce tangents to tracing memory format from metadata Subclasses logic: - Previously coercing tangents logic did not handle nested subclasses case, fixing this. For Subclasses we deduce memory format for subclass_tensor first, then for each element of subclass: [subclass_tensor_memory_format, subclass_tensor_elem0_memory_format, ... ] If subclass element (__tensor_flatten__[0] tensors) is also subclass => on its place we will have a nested list of the same structure. The recursive traversal of subclass tree is expensive. So we do memory format deduction and coercing at the same time, to keep only one traverse for this. With this approach there is no regression in comparison with previous logic which also does one traversal. (`coerce_tangent_and_suggest_memory_format` method). Other small change: Remove duplicated not-related comment. Testing ``` python test/functorch/test_aotdispatch.py -k test_channels_last_grads_no_force_contiguous ``` Benchmarking: After change: ``` └─ $ PYTORCH_AOTD_DEBUG_PROFILE=1 python test/functorch/test_aotdispatch.py -k test_benchmark_grads_no_force_contiguous Benchmark SUBCLASS avg_bwd_duration:4.059906005859375 ms Benchmark NO_SUBCLASS avg_bwd_duration:3.1563830375671387 ms ``` Before change: ``` BEFORE_CHANGE SUBCLASS 4.1194 ``` No siginificant changes in processing time. (We do single traverse of subclass tree for collecting memory_formats and coercing during tracing.) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135225 Approved by: https://github.com/bdhirsh	2024-09-27 15:01:20 +00:00
PyTorch MergeBot	de159f0c8d	Revert "Deal with size oblivious before going into worker (#135137 )" This reverts commit 285fa03b5e1540a52b354664f609f8576c5b5431. Reverted https://github.com/pytorch/pytorch/pull/135137 on behalf of https://github.com/ezyang due to this is the one that actually broke main ([comment](https://github.com/pytorch/pytorch/pull/135137#issuecomment-2379438566))	2024-09-27 14:41:27 +00:00
Justin Chu	1be3d62237	[ONNX] Remove unused functions (#136609 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136609 Approved by: https://github.com/Skylion007	2024-09-27 14:34:05 +00:00
PyTorch MergeBot	e5228a7771	Revert "Don't uselessly recompute axiom dict every static eval call (#135429 )" This reverts commit 507c69e20f645fdb0fbf43b05be0c5117971464e. Reverted https://github.com/pytorch/pytorch/pull/135429 on behalf of https://github.com/malfet due to It(or it's parent) broke trunk CI, see `507c69e20f` ([comment](https://github.com/pytorch/pytorch/pull/135429#issuecomment-2379422971))	2024-09-27 14:33:25 +00:00
Crefeda Rodrigues	a55aa71b04	Limit number of cores to 16 when benchmarking Inductor on ARM (#136424 ) Sets OMP_NUM_THREADS to 16 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136424 Approved by: https://github.com/malfet	2024-09-27 14:22:49 +00:00
PyTorch MergeBot	e9d2765ec8	Revert "Add deterministic path for CUDA `cumsum` (#136224 )" This reverts commit d1bb8e828f280d1c66fff193c043d5bc36154577. Reverted https://github.com/pytorch/pytorch/pull/136224 on behalf of https://github.com/atalman due to Break internal CI ([comment](https://github.com/pytorch/pytorch/pull/136224#issuecomment-2379214226))	2024-09-27 12:54:47 +00:00
Wu, Chunyuan	c2637a7b26	[inductor] [cpp] fix gemm_output_name conflict (#136419 ) Fixes the max-autotune failure of `soft_actor_critic` of Torchbench in FP32 single thread dynamic shape case: ```log File "/home/user/inductor/pytorch/torch/_inductor/codegen/cpp_micro_gemm.py", line 136, in codegen_call C_ptr = f"&({kernel.index(C, [0, 0])})" File "/home/user/inductor/pytorch/torch/_inductor/codegen/cpp_template_kernel.py", line 135, in index else self.args.input(node.get_name()) File "/home/user/inductor/pytorch/torch/_inductor/codegen/common.py", line 1251, in input assert name not in V.graph.removed_buffers, name AssertionError: buf_GemmOut ``` The 1st and 2nd linear does not need to use local buffer while the 3rd linear needs to use local buffer. The 3rd linear which uses local buffer will add its global buffer (named as `buf_GemmOut`) into `V.graph.removed_buffers`. When scheduling the nodes, the 1st linear (won't use local buffer) will get its output buffer (also named as `buf_GemmOut`) from the input and found that it's in the `V.graph.removed_buffers` and raise AssertionError. The issue is that the output buffer of all these linears are all names with `buf_GemmOut`, which have a conflict. Rename these buffers by adding the name of the `template_buffer` as the prefix. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136419 Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5 ghstack dependencies: #136418, #136518	2024-09-27 12:23:17 +00:00
Boyuan Feng	b42f1e3641	[Flex Attention] fix block size order (#136657 ) `create_block_mask` currently gives wrong BLOCK_SIZE and shape when using non-default block size `(128,128)`. This PR fixes the issue by using BLOCK_SIZE order `(Q_BLOCK_SIZE, KV_BLOCK_SIZE)`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136657 Approved by: https://github.com/Chillee, https://github.com/drisspg	2024-09-27 11:26:47 +00:00
IvanKobzarev	9581508383	[aotd] Cleanup on subclasses in inductor freezing (#136549 ) Cleanup: 1/ We do not need to unwrap_subclasses() in freezing wrapper, as it will be wrapped by AOTD wrappers which inclused SubclassesWrapper 2/ No need to use weakreferences for unwrapped list, dynamo optimizers need to clean unwrapped list along with original params_flat. Verfified fbcode tests compiled_optimizers Differential Revision: [D63393651](https://our.internmc.facebook.com/intern/diff/D63393651) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136549 Approved by: https://github.com/bdhirsh	2024-09-27 11:20:03 +00:00
cyy	bbff667e32	[Distributed] [13/N] Fix clang-tidy warnings in torch/csrc/distributed/ (#136713 ) Follows #136528 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136713 Approved by: https://github.com/kwen2501	2024-09-27 10:11:53 +00:00
Salman Mohammadi	48c18ff850	[dynamo] Added support for tensor's `is_inference` method (#136450 ) Fixes #135439 This PR adds support for the `is_inference` method on torch tensors which successfully compiles the following example fn without graph breaks: ```python def fn_simple(x): if x.is_inference(): return x.sum() else: return x.min() ``` I've also tried to add guards on the tensor to guard against `is_inference`. I wasn't 100% sure where these should go so please don't hesitate to correct me. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136450 Approved by: https://github.com/ezyang	2024-09-27 09:15:07 +00:00
FFFrog	e14b58ffbd	Using device-agnostic autocast api (#136613 ) - using torch.autocast(device_str="cuda") instead of torch.cuda.amp.autocast() - using torch.autocast(device_str="cpu") instead of torch.cpu.amp.autocast() Pull Request resolved: https://github.com/pytorch/pytorch/pull/136613 Approved by: https://github.com/shink, https://github.com/cyyever, https://github.com/kwen2501	2024-09-27 07:16:24 +00:00
Howard Huang	ad6c70b656	[PP] Remove modifications to autograd nodes in ZB (#136678 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136678 Approved by: https://github.com/wconstab, https://github.com/kwen2501 ghstack dependencies: #136507, #136584	2024-09-27 07:07:58 +00:00
hippocookie	9529d018e9	Refactor offset logic and work for nD (#135861 ) Optimize TODO task in code in distributed test files. - TODO: make this test cleaner and work for nD - TODO: add comments for create_plan/TestDedupTensor Pull Request resolved: https://github.com/pytorch/pytorch/pull/135861 Approved by: https://github.com/wz337	2024-09-27 06:13:06 +00:00
Nikita Shulga	69bd13d12e	[EZ][BE] Add `torch.complex` to MPS_DTYPES (#136755 ) As minimal supported OS has been rasied to MacOS 13, some basic complex operations should be supported, and the rest could be `xfailed` Pull Request resolved: https://github.com/pytorch/pytorch/pull/136755 Approved by: https://github.com/Skylion007 ghstack dependencies: #136754	2024-09-27 05:01:40 +00:00
Laith Sakka	73f038c5b3	Log total miss inplaced bytes (#136684 ) Summary: title. Test Plan: add tests. run existing tests. Differential Revision: D63411459 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136684 Approved by: https://github.com/zou3519	2024-09-27 04:57:57 +00:00
Oguz Ulgen	0200bea562	Delete grid reduction optimization that is causing specialization (#136783 ) Summary: https://fb.workplace.com/groups/1075192433118967/posts/1510513706253502 Creating a set is causing symexpr to specialize Test Plan: CI Reviewed By: ezyang Differential Revision: D63432357 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136783 Approved by: https://github.com/ezyang, https://github.com/zou3519	2024-09-27 04:39:39 +00:00
Bob Ren	a63d7cb54c	add typing to _dynamo/current_scope_id.py (#136676 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136676 Approved by: https://github.com/jansel, https://github.com/zou3519, https://github.com/Skylion007	2024-09-27 04:09:15 +00:00
PyTorch MergeBot	5eb68d565f	Revert "[inductor] Triton codegen: Use scalar when creating f64 constant instead of 1-element tensor (#136594 )" This reverts commit 2c5f5e303a8d6fd55b6651f4d965fafaa6a540a7. Reverted https://github.com/pytorch/pytorch/pull/136594 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/136594#issuecomment-2378358302))	2024-09-27 04:06:05 +00:00
Edward Z. Yang	507c69e20f	Don't uselessly recompute axiom dict every static eval call (#135429 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/135429 Approved by: https://github.com/isuruf ghstack dependencies: #135137	2024-09-27 04:03:25 +00:00
Edward Z. Yang	285fa03b5e	Deal with size oblivious before going into worker (#135137 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/135137 Approved by: https://github.com/isuruf	2024-09-27 04:03:25 +00:00
Blaine Burton Rister	86631eccda	[Inductor] Remove stride-0 dimensions from more complex block pointers (#135557 ) Related issue: #125077 ### Feature Inductor tries to remove dimensions with stride 0 from block pointers. Rather than loading with stride 0, it's more efficient to load a smaller block pointer, then use `tl.broadcast_to` to broadcast it up to the desired size. This already worked for simpler block pointers, but it was disabled for more complex block pointers which used `tl.reshape` to change the dimensionality after loading. This PR generalizes the approach to work for all block pointers. The idea is to first reshape, adding singleton dimensions, then broadcast those singletons up to something larger, then reshape again to the final output shape. For readability, we emit this code only if it actually does something. Simpler loads will just have `tl.load`. Here's an example of a complicated kernel that uses `reshape` -> `load` -> `reshape`. (The first reshape is actually the slice `[None,None,:]`). ``` @triton.jit def triton_(in_ptr0, in_ptr1, out_ptr0, xnumel, XBLOCK : tl.constexpr): xnumel = 64 xoffset = tl.program_id(0) * XBLOCK xindex = xoffset + tl.arange(0, XBLOCK)[:] xmask = xindex < xnumel x2 = xindex x1 = (xindex // 8) tmp0 = tl.load(tl.make_block_ptr(in_ptr0, shape=[64], strides=[1], block_shape=[XBLOCK], order=[0], offsets=[xoffset]), boundary_check=[0]) tmp1 = tl.reshape(tl.broadcast_to(tl.load(tl.make_block_ptr(in_ptr1, shape=[8], strides=[8], block_shape=[((7 + XBLOCK) // 8)], order=[0], offsets=[(xoffset // 8)]), boundary_check=[0], eviction_policy='evict_last')[:, None, None], [((7 + XBLOCK) // 8), ((1) * ((1) <= (((7 + XBLOCK) // 8))) + (((7 + XBLOCK) // 8)) * ((((7 + XBLOCK) // 8)) < (1))), ((8) * ((8) <= (XBLOCK)) + (XBLOCK) * ((XBLOCK) < (8)))]), [XBLOCK]) tmp2 = tmp0 + tmp1 tl.store(tl.make_block_ptr(out_ptr0, shape=[64], strides=[1], block_shape=[XBLOCK], order=[0], offsets=[xoffset]), tmp2.to(tl.float32), boundary_check=[0]) ''', device_str='cuda') ``` Before this PR, we would have stride-0 dimensions: ``` @triton.jit def triton_(in_ptr0, in_ptr1, out_ptr0, xnumel, XBLOCK : tl.constexpr): xnumel = 64 xoffset = tl.program_id(0) * XBLOCK xindex = xoffset + tl.arange(0, XBLOCK)[:] xmask = xindex < xnumel x2 = xindex x1 = (xindex // 8) tmp0 = tl.load(tl.make_block_ptr(in_ptr0, shape=[64], strides=[1], block_shape=[XBLOCK], order=[0], offsets=[xoffset]), boundary_check=[0]) tmp1 = tl.reshape(tl.load(tl.make_block_ptr(in_ptr1, shape=[8, 1, 8], strides=[8, 0, 0], block_shape=[((7 + XBLOCK) // 8), ((1) * ((1) <= (((7 + XBLOCK) // 8))) + (((7 + XBLOCK) // 8)) * ((((7 + XBLOCK) // 8)) < (1))), ((8) * ((8) <= (XBLOCK)) + (XBLOCK) * ((XBLOCK) < (8)))], order=[2, 1, 0], offsets=[(xoffset // 8), 0, xoffset % 8]), boundary_check=[0], eviction_policy='evict_last'), [XBLOCK]) tmp2 = tmp0 + tmp1 tl.store(tl.make_block_ptr(out_ptr0, shape=[64], strides=[1], block_shape=[XBLOCK], order=[0], offsets=[xoffset]), tl.broadcast_to(tmp2, [XBLOCK]).to(tl.float32), boundary_check=[0]) ''', device_str='cuda') ``` Here's a simpler example where we use 2D tiling. In this case we don't actually need the broadcast. The broadcast is implied via a slice adding a new singleton dimension. This code is not changed by this PR, but it's important to know that we don't accidentally insert unnecessary broadcasts. ``` @triton.jit def triton_(in_ptr0, in_ptr1, out_ptr0, ynumel, xnumel, YBLOCK : tl.constexpr, XBLOCK : tl.constexpr): ynumel = 8 xnumel = 8 yoffset = tl.program_id(1) * YBLOCK yindex = yoffset + tl.arange(0, YBLOCK)[None, :] ymask = yindex < ynumel xoffset = tl.program_id(0) * XBLOCK xindex = xoffset + tl.arange(0, XBLOCK)[:, None] xmask = xindex < xnumel x1 = xindex y0 = yindex tmp0 = tl.load(tl.make_block_ptr(in_ptr0, shape=[8, 8], strides=[1, 8], block_shape=[XBLOCK, YBLOCK], order=[1, 0], offsets=[xoffset, yoffset]), boundary_check=[0, 1]) tmp1 = tl.load(tl.make_block_ptr(in_ptr1, shape=[8], strides=[8], block_shape=[YBLOCK], order=[0], offsets=[yoffset]), boundary_check=[0], eviction_policy='evict_last')[None, :] tmp2 = tmp0 + tmp1 tl.store(tl.make_block_ptr(out_ptr0, shape=[8, 8], strides=[1, 8], block_shape=[XBLOCK, YBLOCK], order=[1, 0], offsets=[xoffset, yoffset]), tmp2.to(tl.float32), boundary_check=[0, 1]) ''', device_str='cuda') ``` ### Test Plan Added a new expecttest to check the emitted code for broadcast addition. Looking at the test, we can see that stride 0 dimensions are removed. (This test generated the example kernels in the previous section.) This change also removed a stride-0 dimension in an existing block pointer test. I updated the expected code accordingly. Bonus: I noticed that the test parametrization for `config.prefer_nd_tiling` wasn't working as intended. It ended up always setting this option to `True`. Fixed it so we get the intended test coverage. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135557 Approved by: https://github.com/shunting314, https://github.com/jansel Co-authored-by: Yueming Hao <yhao@meta.com>	2024-09-27 04:01:40 +00:00
Sam Larsen	2c5f5e303a	[inductor] Triton codegen: Use scalar when creating f64 constant instead of 1-element tensor (#136594 ) Summary: We have an internal report of a Triton compiler error `ValueError: Cannot broadcast, rank mismatch: [1], [1, 2048]` coming from a line like this: `tmp25 = tl.broadcast_to(((tl.full([1], 1.00000000000000, tl.float64)) + ((ks0 // 3278).to(tl.float64))) / (((tl.full([1], 0.500000000000000, tl.float64))(libdevice.sqrt((1 + ((ks0 // 3278)(ks0 // 3278)) + ((-2)(ks0 // 3278))).to(tl.float64).to(tl.float32)))) + ((tl.full([1], 0.500000000000000, tl.float64))((1 + (ks0 // 3278)).to(tl.float64)))), [XBLOCK, RBLOCK]) ` https://github.com/pytorch/pytorch/pull/135260 is the cause, presumably because we turn a constant into a 1-element tensor with: `(tl.full([1], const, tl.float64))`. It looks like changing the syntax to `(tl.full([], const, tl.float64))` gives us what we want? Differential Revision: [D63465169](https://our.internmc.facebook.com/intern/diff/D63465169) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136594 Approved by: https://github.com/mengluy0125, https://github.com/jansel	2024-09-27 04:01:09 +00:00
Edward Z. Yang	a2d2a30311	Add torch._dynamo.config.fail_on_cache_limit_hit (#136767 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/136767 Approved by: https://github.com/albanD, https://github.com/jansel ghstack dependencies: #136533	2024-09-27 03:58:00 +00:00
Mu-Chu Lee	2521cd3874	Skip kernel saving if already existed. (#136389 ) Summary: We skip the save_gpu_kernel if kernel is being saved already. This would give us a more accurate Triton profiling result. The following trace shows before/after the change for a benchmarking of a trivial addmm: Before: <img width="1255" alt="Screenshot 2024-09-23 at 10 26 53 AM" src="https://github.com/user-attachments/assets/5aea05ef-6ef0-464c-8da9-17b31c97b43a"> After: <img width="910" alt="Screenshot 2024-09-23 at 10 27 03 AM" src="https://github.com/user-attachments/assets/488b7d4f-268f-41cf-8553-cb16ceeae118"> We can see that before the change, the benchmarking includes two parts, (1) The overhead of our triton_heuristic call, which includes the save/get, and the (expensive) hash computation. (2) The exact computation of Triton kernel. We see that (1) accounts >50% of time, which makes kernel selection for profiling often choose aten kernels over Triton kernels. Test Plan: Existing OSS CI [Redacted, Some internal model results in D63441430] Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/136389 Approved by: https://github.com/desertfire	2024-09-27 03:03:28 +00:00
Fuzzkatt	d1382aaf3d	skip test_out_of_memory for jetson (#133270 ) Skip test_out_of_memory in test/test_cuda.py on Jetson as OOM reporting in Jetson has issues due to partially missing NVML support. cc @eqy Pull Request resolved: https://github.com/pytorch/pytorch/pull/133270 Approved by: https://github.com/eqy, https://github.com/albanD, https://github.com/seemethere	2024-09-27 02:36:48 +00:00
Bin Bao	26869d38e1	[Inductor] Further solve missing aoti_torch_check symbole issue (#136775 ) Summary: https://github.com/pytorch/pytorch/pull/136669 didn't resolve all the internal test failures. Add more tests to OSS CI to catch the remaining issues, and fix some internal TARGETS dependency. Differential Revision: D63473744 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136775 Approved by: https://github.com/henrylhtsang	2024-09-27 02:26:49 +00:00
CaoE	66340e6751	Fix numerical instability for norm (#129352 ) Fixes #123645 When the reduce size is large, reducing directly may exceed the range that FP32 can represent, resulting in incorrect results. Reducing in group and using double as the intermediate cumulative type can avoid exceeding the representation range. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129352 Approved by: https://github.com/jgong5, https://github.com/peterbell10	2024-09-27 00:51:31 +00:00
Sahan Paliskara	adc77a9b7f	[lintrunner] auto apply formatting changes as suggestions (#136239 ) (Remove spurious cc) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136239 Approved by: https://github.com/huydhn, https://github.com/eqy Co-authored-by: Huy Do <huydhn@gmail.com>	2024-09-27 00:51:21 +00:00
Ruben Rodriguez Buchillon	faedee12fa	[test] enable test_triton_wrapper again (#136721 ) Summary: Reenable the `test_triton_wrapper.py` test again # Why We want this to run internally # What - fix python path issue on the test - reenable the test # Background It appears that the parent process does not pass the entire path down to the child process. Namely, if there is some setup that makes the sys.path effectively look different than, say, PYTHONPATH or something like this, the child will not inherit this setup. To avoid needing to keep track of specific setups, we pass the effective `sys.path` from the parent to the child through the PYTHONPATH env variable Test Plan: buck2 test 'fbcode//mode/opt' fbcode//caffe2/test/inductor:triton_wrapper Differential Revision: D63438186 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136721 Approved by: https://github.com/henrylhtsang	2024-09-27 00:44:40 +00:00
ankurneog	22a4129a76	Generalization of FSDP common for non-cuda execution (#133209 ) ## Motivation The FSDP common code for FSDP UT execution is mostly written with cuda device in mind. However other devices such the intel Gaudi supports most of the functionality. We are generalizing the base content so that the UT content can be used for non-cuda device execution. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133209 Approved by: https://github.com/kwen2501	2024-09-27 00:38:10 +00:00
Sergii Dymchenko	a619ced5ed	Revert "Update run_test.py" This reverts commit 193073b4914a7f80758541d391eacbe21194ecdf.	2024-09-26 17:34:52 -07:00
Sergii Dymchenko	193073b491	Update run_test.py	2024-09-26 16:56:29 -07:00
eellison	aa56f80ec1	Dont pairwise check unfusable nodes in scheduler (#136682 ) Gives 8% wall time speedup on n=1000 benchmark in https://github.com/pytorch/pytorch/pull/136429 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136682 Approved by: https://github.com/ezyang, https://github.com/jansel, https://github.com/shunting314	2024-09-26 23:46:52 +00:00
Nikita Shulga	0b62ebfeaa	[CI] Populate `JOB_ID` for MPS tests (#136791 ) Move `get-job-id` steps before running the tests and copy-n-paste environment variables from `_mac-test.yml` added in https://github.com/pytorch/pytorch/pull/113099 Should fix the following warning during MPS test run: ``` /Users/ec2-user/runner/_work/pytorch/pytorch/tools/stats/upload_metrics.py:147: UserWarning: Not emitting metrics for td_test_failure_stats_v2. Missing job_id. Please set the JOB_ID environment variable to pass in this value. warn(f"Not emitting metrics for {metric_name}. {e}") ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/136791 Approved by: https://github.com/albanD, https://github.com/izaitsevfb	2024-09-26 23:00:52 +00:00
Bin Bao	da5c7b6f4e	[AOTI] Set CUDA device for torch._export.aot_load (#136715 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/136369. When a CUDA device with index is specified when calling torch._export.aot_load, we need to specify the CUDA device when running model.so. Differential Revision: [D63438335](https://our.internmc.facebook.com/intern/diff/D63438335) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136715 Approved by: https://github.com/angelayi	2024-09-26 22:20:12 +00:00
Joel Schlosser	991f8f8ec3	Bias gradient calculation for NJT linear backward (#136660 ) Previously NYI - @mikaylagawarecki needs it for Transformers. Fixes #136652 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136660 Approved by: https://github.com/soulitzer	2024-09-26 21:38:10 +00:00
eqy	c0e98a485b	[FP8][CUDA] Fix stale expected error message (#136581 ) CC @nWEIdia as I think we have seen internal failures on this Pull Request resolved: https://github.com/pytorch/pytorch/pull/136581 Approved by: https://github.com/mikaylagawarecki	2024-09-26 20:57:38 +00:00
Roy Hvaara	5789f8d5dc	[MPS] Add regression test for large inputs to `F.linear` (#136084 ) This PR adds a regression test for the issue reported in #122045. I was not able to reproduce on macOS > 13. ~Expect the first iteration of the tests to fail for macOS 13, but pass for 14 and 15.~ Pull Request resolved: https://github.com/pytorch/pytorch/pull/136084 Approved by: https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2024-09-26 20:46:14 +00:00
Sergii Dymchenko	9656a603b2	Fix lint (#136781 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136781 Approved by: https://github.com/clee2000, https://github.com/ZainRizvi, https://github.com/malfet	2024-09-26 19:13:56 +00:00
Sergii Dymchenko	c878ea2c4e	Add info about "release tracker" label for cherry-picking bot (#136777 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136777 Approved by: https://github.com/seemethere, https://github.com/atalman	2024-09-26 18:45:45 +00:00
Jithun Nair	851b9732aa	Download pre-compiled AOTriton from GitHub unless AOTRITON_INSTALL_FROM_SOURCE=1 is set (#136603 ) PyTorch community members have reported issues with building PyTorch from source for ROCm in an environment that doesn't have aotriton pre-installed, because aotriton is only installed in the [CI](`a8ed873ba2/.ci/docker/manywheel/Dockerfile (L197)`) docker images. Building aotriton from source can take ~45 minutes. This PR fixes the issue by downloading the aotriton tarball in such scenarios, unless the user explicitly wants to build aotriton from source using the AOTRITON_INSTALL_FROM_SOURCE=1 env var Pull Request resolved: https://github.com/pytorch/pytorch/pull/136603 Approved by: https://github.com/atalman Co-authored-by: Xinya Zhang <Xinya.Zhang@amd.com>	2024-09-26 18:05:51 +00:00
Pian Pawakapan	f0a92541fe	[export] fix lifted constants order for 0-input graphs (#136658 ) Summary: With empty graphs, the `graph.inserting_before(first_user_input = None)` call turns into a `graph.inserting_after(root)` call, inverting the order of constant input nodes being inserted. This fixes the issue by initializing to the first node in the graph (still valid if not a user input - only used for insertion). Test Plan: test_export Differential Revision: D63403514 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136658 Approved by: https://github.com/avikchaudhuri	2024-09-26 17:44:24 +00:00
fduwjj	40c825d773	[reland] [torchelastic][c10d] Fix store prefix race in rendezvous (#136768 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136768 Approved by: https://github.com/kwen2501, https://github.com/atalman	2024-09-26 17:37:07 +00:00
Rachel Guo	da09984c0d	[AOTI][Tooling][9/n] Add debug printer support for cpp kernel type (#136465 ) Summary: As title. Cpp kernel has a different codegen path: https://www.internalfb.com/code/fbsource/[6df946858879dd9bcefa18710dd79095a957f0dd]/fbcode/caffe2/torch/_inductor/codegen/cpp.py?lines=4643 Previously it is not wrapped/supported by the debug printer manager. This diff adds this support. It can be useful for cpu models. See this for a use case: https://www.internalfb.com/phabricator/paste/view/P1598561051?lines=927 Test Plan: ``` AOT_INDUCTOR_DEBUG_INTERMEDIATE_VALUE_PRINTER=2 TORCHINDUCTOR_FORCE_DISABLE_CACHES=1 TORCHINDUCTOR_ABI_COMPATIBLE=1 TORCH_COMPILE_DEBUG=1 TORCH_LOGS="+graph, inductor, +schedule, output_code" buck2 run 'fbcode//mode/opt' fbcode//accelerators/workloads/models/slimdsnn:slimdsnn -- aot --batch-size 1 ``` Differential Revision: D63053101 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136465 Approved by: https://github.com/hl475	2024-09-26 17:30:43 +00:00
Tugsbayasgalan (Tugsuu) Manlaibaatar	e4e83a4ac4	Remove aten.item hack (#136663 ) Summary: Title Test Plan: CI Differential Revision: D63404353 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136663 Approved by: https://github.com/bdhirsh	2024-09-26 17:14:48 +00:00
albanD	2421344d8f	Update current maintainers (#136672 ) This file didn't had an overall in a few years so long overdue. Most of the credit goes to @orionr for gathering all of this info. The main rules we followed: - No code contributor is removed, they're all placed as emeritus - Breakdown too big categories to make this document useful to know who to ping - No category where the code is still in the codebase is removed - We did not rework the categories (for example to be closer to module: labels) and leave that for later - All non-emeritus names are ordered by their number of comments on issues related to their topic Pull Request resolved: https://github.com/pytorch/pytorch/pull/136672 Approved by: https://github.com/eqy, https://github.com/ezyang, https://github.com/seemethere, https://github.com/malfet	2024-09-26 17:13:16 +00:00
Edward Z. Yang	beb46de342	Correctly convert Python float to float64 when passing argument as Tensor (#136413 ) I can't actually test the Dynamo codegen fix as it is impossible to directly use the Tensor at the moment. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/136413 Approved by: https://github.com/bobrenjc93 ghstack dependencies: #136599	2024-09-26 16:50:13 +00:00
Edward Z. Yang	11fd55827d	Make CLOSURE_VARS construction lazy (#136599 ) This makes us less likely to hit import cycle problems with torch Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/136599 Approved by: https://github.com/anijain2305	2024-09-26 16:50:13 +00:00
drisspg	ff2360c733	[FlexAttention] Reduce expensive test time by 10x (#136677 ) Now that we support non 128 divisble sequence lengths; drops expensive tests by like 10x Before ```Shell 46.32s call test/inductor/test_flex_attention.py::TestFlexAttention::test_aot_eager_gradcheck_score_mod1 45.61s call test/inductor/test_flex_attention.py::TestFlexAttention::test_aot_eager_gradcheck_score_mod2 44.45s call test/inductor/test_flex_attention.py::TestFlexAttention::test_aot_eager_gradcheck_score_mod3 43.61s call test/inductor/test_flex_attention.py::TestFlexAttention::test_aot_eager_gradcheck_score_mod0 ``` After: ```Shell 4.25s call test/inductor/test_flex_attention.py::TestFlexAttention::test_aot_eager_gradcheck_score_mod5 4.20s call test/inductor/test_flex_attention.py::TestFlexAttention::test_aot_eager_gradcheck_score_mod4 4.19s call test/inductor/test_flex_attention.py::TestFlexAttention::test_aot_eager_gradcheck_score_mod1 4.04s call test/inductor/test_flex_attention.py::TestFlexAttention::test_aot_eager_gradcheck_score_mod2 3.99s call test/inductor/test_flex_attention.py::TestFlexAttention::test_aot_eager_gradcheck_score_mod0 3.98s call test/inductor/test_flex_attention.py::TestFlexAttention::test_aot_eager_gradcheck_score_mod3 ```` Pull Request resolved: https://github.com/pytorch/pytorch/pull/136677 Approved by: https://github.com/Chillee ghstack dependencies: #136673	2024-09-26 16:40:21 +00:00
drisspg	840c6b7a68	[FlexAttention] Add Better error message for cpu tensors (#136673 ) Partially address: #136525 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136673 Approved by: https://github.com/Chillee	2024-09-26 16:40:21 +00:00
Thanh Ha	ddab704b28	Use wildcard for portion of AMI version number (#136764 ) Rather than specifying a specific version number for the AMIs, use wildcards for the date section. Issue: pytorch/pytorch#136762 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136764 Approved by: https://github.com/ZainRizvi	2024-09-26 16:39:25 +00:00
cyy	59e8f8228f	[3/N] Fix clang-tidy warnings in torch/csrc/lazy (#136705 ) Follows #136634 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136705 Approved by: https://github.com/Skylion007	2024-09-26 16:29:43 +00:00
Jez Ng	31c0467594	Add Triton CPU as an Inductor backend (#133408 ) The goal is to use Inductor-generated kernels to stress test the new Triton CPU backend. Differential Revision: [D63298968](https://our.internmc.facebook.com/intern/diff/D63298968) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133408 Approved by: https://github.com/jansel, https://github.com/blaine-rister, https://github.com/malfet	2024-09-26 15:35:26 +00:00
Nikita Shulga	68579ef665	[EZ][MPS] Extend `arange` to bfloat16 (#136754 ) RangeFactories class is the only one that uses `AT_DISPATCH_MPS_TYPES` Fixes https://github.com/pytorch/pytorch/issues/136624 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136754 Approved by: https://github.com/Skylion007	2024-09-26 15:33:45 +00:00
Nikita Shulga	73ec76ed50	[MPS] Implement `isposinf` and `isneginf` (#136689 ) Not sure, why `isinf` is a composite op, but those needs to be implemented by hand. Implementation is a trivial call to ```objc [mpsGraph equalWithPrimaryTensor:input secondaryTensor:[mpsGraph constantWithScalar:std::numeric_limits<T>::infinity() dataType:input.dataType]] ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/136689 Approved by: https://github.com/Skylion007	2024-09-26 15:33:20 +00:00
drisspg	d05645841e	Update get_device_properties to take in optional device (#136683 ) Aligns behavior with the rest of cuda's device info query methods Pull Request resolved: https://github.com/pytorch/pytorch/pull/136683 Approved by: https://github.com/eqy	2024-09-26 15:07:31 +00:00
PyTorch MergeBot	d5e4a20c17	Revert "Introduce _ArglessActivation base class for parameterless activation functions (#136296 )" This reverts commit dda0e4de32b29098f25f9b2889423c9446680cc1. Reverted https://github.com/pytorch/pytorch/pull/136296 on behalf of https://github.com/atalman due to Breaks Internal CI. Error: Too many arguments [19]: Call `nn.modules.activation._ArglessActivation.__init__` expects 0 positional arguments, 1 was provided. ([comment](https://github.com/pytorch/pytorch/pull/136296#issuecomment-2377091280))	2024-09-26 14:12:12 +00:00
Joel Schlosser	4150ab44a4	Fix composite op redispatch for NJT in inference mode (#134683 ) Prior to this PR, calling `reshape()` under `inference_mode()` would throw a `NotImplementedError`. This is because `inference_mode()` disables autograd key dispatch, incidentally preventing the decomposition of reshape for NJT. This PR fixes this by redispatching on the `CompositeImplicitAutogradNestedTensor` key whenever a composite implicit op is encountered in `NJT.__torch_dispatch__()`. This fixes reshape and any other composite implicit ops underneath `inference_mode()`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134683 Approved by: https://github.com/soulitzer, https://github.com/albanD ghstack dependencies: #136566	2024-09-26 14:10:53 +00:00
Joel Schlosser	f8debd5d83	Fix wrapper subclass reentrant dispatch + TorchDispatchMode (#136566 ) Fixes #136565 This PR makes the python fallback robust to the case where there are no active modes & no tensors with the Python key. In this case, simply redispatch with the Python key disabled. This was found when trying to use reentrant dispatch for NJT to get decompositions under `inference_mode()` when the autograd key is disabled. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136566 Approved by: https://github.com/bdhirsh	2024-09-26 14:06:51 +00:00
leslie-fang-intel	963e793e1b	[Inductor][CPP] Optimize WOQ INT8 wgt dequant in AMX GEMM template (#136630 ) Summary Optimize the WOQ int8 AMX performance by changing the int8 -> bf16 conversion. Earlier, 16 int8 elements were being loaded at a time & converted to 16 BF16 elements. With this change, 32 int8 elements will be loaded at a time, and converted to a cache-line of 32 BF16 elements more efficiently. Performance before ``` AUTOTUNE _weight_int8pack_mm(4096x4096, 4096x4096, 4096) cpp_packed_gemm_0 38.0439 ms 100.0% _weight_int8pack_mm 50.2524 ms 75.7% SingleProcess AUTOTUNE benchmarking takes 1.1087 seconds and 1.9791 seconds precompiling AUTOTUNE _weight_int8pack_mm(4096x4096, 11008x4096, 11008) cpp_packed_gemm_4 78.2038 ms 100.0% _weight_int8pack_mm 119.1962 ms 65.6% SingleProcess AUTOTUNE benchmarking takes 1.9274 seconds and 1.9949 seconds precompiling AUTOTUNE _weight_int8pack_mm(4096x11008, 4096x11008, 4096) cpp_packed_gemm_6 79.2368 ms 100.0% _weight_int8pack_mm 118.3212 ms 67.0% SingleProcess AUTOTUNE benchmarking takes 1.9200 seconds and 2.0015 seconds precompiling AUTOTUNE _weight_int8pack_mm(4096x4096, 32000x4096, 32000) cpp_packed_gemm_224 225.7201 ms 100.0% _weight_int8pack_mm 388.5588 ms 58.1% ``` Performance after this PR ``` AUTOTUNE _weight_int8pack_mm(4096x4096, 4096x4096, 4096) cpp_packed_gemm_0 11.0086 ms 100.0% _weight_int8pack_mm 50.2918 ms 21.9% SingleProcess AUTOTUNE benchmarking takes 1.0837 seconds and 2.0301 seconds precompiling AUTOTUNE _weight_int8pack_mm(4096x4096, 11008x4096, 11008) cpp_packed_gemm_4 24.3528 ms 100.0% _weight_int8pack_mm 119.8492 ms 20.3% SingleProcess AUTOTUNE benchmarking takes 1.8303 seconds and 1.8195 seconds precompiling AUTOTUNE _weight_int8pack_mm(4096x11008, 4096x11008, 4096) cpp_packed_gemm_6 24.6148 ms 100.0% _weight_int8pack_mm 119.1908 ms 20.7% SingleProcess AUTOTUNE benchmarking takes 1.8315 seconds and 1.8352 seconds precompiling AUTOTUNE _weight_int8pack_mm(4096x4096, 32000x4096, 32000) cpp_packed_gemm_224 78.1369 ms 100.0% _weight_int8pack_mm 387.6289 ms 20.2% SingleProcess AUTOTUNE benchmarking takes 4.5059 seconds and 1.8010 seconds precompiling ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/136630 Approved by: https://github.com/jgong5 ghstack dependencies: #136353	2024-09-26 08:41:58 +00:00
Menglu Yu	77fba0c407	[PT2][Optimus] Fix a group batch fusion corner case (#136650 ) Summary: We have a user report on BA model that it raised "AttributeError: 'SymFloat' object has no attribute 'shape'", thus we add type check for the meta node. See more context in the post https://fb.workplace.com/groups/1075192433118967/permalink/1510477489590457/ Test Plan: # local reproduce ``` CUDA_VISIBLE_DEVICES=3 OC_CAUSE=1 buck2 run mode/opt //scripts/jackiexu0313/pt2:local_model_with_pt2 -- --test_mode split-batch-decompose --flow_id 646303196 ``` P1609807876 # E2E before fix f646303196 after fix Differential Revision: D63399959 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136650 Approved by: https://github.com/ezyang	2024-09-26 06:35:11 +00:00
Kurt Mohler	d1bb8e828f	Add deterministic path for CUDA `cumsum` (#136224 ) Change `cumsum` to call its decomposition when `use_deterministic_algorithms(True)` and input is CUDA. Fixes #89492 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136224 Approved by: https://github.com/ezyang, https://github.com/justinchuby	2024-09-26 04:52:05 +00:00
PyTorch MergeBot	b408591b53	Revert "[Flex Attention] fix block size order (#136657 )" This reverts commit 529b6ab0bb9f8800ed795ec8e4fa1f0e8042bb0a. Reverted https://github.com/pytorch/pytorch/pull/136657 on behalf of https://github.com/huydhn due to Sorry for reverting your change but some test_flex_attention is failing in trunk after this change `529b6ab0bb` ([comment](https://github.com/pytorch/pytorch/pull/136657#issuecomment-2375824802))	2024-09-26 04:06:41 +00:00
cyy	3c542ce831	[Reland] Check function declarations of COREML code (#136070 ) Reland of #135467 by fixing periodic workflows. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136070 Approved by: https://github.com/ezyang	2024-09-26 03:52:06 +00:00
Roy Hvaara	042af7ec53	[BE] [MPS] Use validation helper for input tensors (#134609 ) Small refactor to use already existing helper with equivalent behavior. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134609 Approved by: https://github.com/malfet	2024-09-26 03:47:30 +00:00
rzou	e4d32d2194	Improve data-dependent-output meta kernel error message (#136671 ) Test Plan: - code reading Pull Request resolved: https://github.com/pytorch/pytorch/pull/136671 Approved by: https://github.com/williamwen42	2024-09-26 03:46:04 +00:00
xinan.lin	190e09d8b6	[Inductor UT] Generalize device-bias code introduced from #134874 and (#136596 ) [Inductor UT] Generalize device-bias code introduced from #134874 and fix unexpected success test cases. Fix #136595 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136596 Approved by: https://github.com/EikanWang, https://github.com/jansel Co-authored-by: Yu, Guangye <guangye.yu@intel.com>	2024-09-26 02:56:59 +00:00
eugenekoran	dda0e4de32	Introduce _ArglessActivation base class for parameterless activation functions (#136296 ) Fixes #133683 Fixes #133684 Fixes #133688 This PR introduces a new base class `_ArglessActivation` and refactors five existing activation functions to inherit from it. This change aims to improve documentation consistency and also API consistency with other activation functions that do have parameters and explicitly call `super().__init__()` Key changes and considerations: 1. Added new class `_ArglessActivation`: 2. Refactored the following classes to inherit from `_ArglessActivation`: - Sigmoid - Tanh - Softsign - Tanhshrink - Softmax2d 3. Performance consideration: - This change introduces a slight overhead for creating a new stack frame and handling an additional function call on every instance creation - The impact is expected to be minimal in most use cases Docs view before: <img width="425" alt="Screen Shot 2024-09-18 at 3 00 22 PM" src="https://github.com/user-attachments/assets/ca0d1000-44c5-4c52-b344-68f7e170bafe"> Docs view after: <img width="431" alt="Screen Shot 2024-09-18 at 3 00 52 PM" src="https://github.com/user-attachments/assets/f7ceb8f3-a2a2-4fd6-a2b8-39105a02bcbd"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/136296 Approved by: https://github.com/mikaylagawarecki	2024-09-26 02:45:05 +00:00
rzou	d0456b4274	noop on torch.library APIs under torch::deploy (multipy) (#136645 ) Fixes https://github.com/pytorch/pytorch/issues/136177 The motivation is that torch::deploy doesn't handle this well. The workaround for users is to use C++ custom ops. All torch.library APIs ultimately go through the torch.library.Library object, so we add checks to noop for torch::deploy there. Test Plan: - new test - going to test this internally and hope nothing breaks. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136645 Approved by: https://github.com/ezyang	2024-09-26 02:34:34 +00:00
Bin Bao	5c78c6b05a	[CI] Switch aarch64 dashboard run back to nightly (#136643 ) Summary: Reduce the frequency of the aarch64 dashboard CI run since we don't need to monitor its instability anymore. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136643 Approved by: https://github.com/huydhn	2024-09-26 01:26:05 +00:00
Howard Huang	141cae2eb8	[pipelining] Fix more leaks and check leaks in tests (#136584 ) Fix two more leaks of the same variety as #136507 (see that PR desc and attached gdoc for debug details). This time, also add a test-time check that helped to discover new leaks and ensure we won't accidently regress. Adds `check_tensor_leak` util which internally asserts no tensors are being kept alive by other objects involved in py ref cycles. Uses objgraph for a nice debug utility when a leak is found. Credit to @H-Huang for pointing out objdump and helping debug the 'param_group["intermediates"]` leak. I manually confirmed that all 3 of the leaks identified/fixed so far are caught by the unit test and checker. Sample output, if I re-introduce a leak by commenting out `del param_group["intermediates"]` in _backward.py, and run `python test/distributed/pipelining/test_schedule_multiproc.py -k test_schedule_with_native_zero_bubble`: ``` warnings.warn( /data/users/whc/pytorch/torch/testing/_internal/common_utils.py:5341: UserWarning: 34 tensors were found in the garbage. Did you introduce a reference cycle? warnings.warn( /data/users/whc/pytorch/torch/testing/_internal/common_utils.py:5347: UserWarning: Dumping first 1 objgraphs of leaked tensors rendered to png Graph written to /tmp/objgraph-ztz642h3.dot (19 nodes) Graph viewer (xdot) not found, generating a png instead Image generated as /tmp/objgraph-ztz642h3.png ``` rendering of ` /tmp/objgraph-ztz642h3.png`: <img width="1671" alt="image" src="https://github.com/user-attachments/assets/9098ff29-224c-4533-935b-83c210ac2e22"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/136584 Approved by: https://github.com/kwen2501, https://github.com/H-Huang ghstack dependencies: #136507 Co-authored-by: Howard Huang <howardhuang@fb.com>	2024-09-26 01:10:40 +00:00
Nichols A. Romero	e8f1dd6ba0	Fix hardcoded ROCm paths in `Caffe2Targets.cmake` (#136283 ) Fixes #131701 Use CMake imported targets more consistently to eliminate hardcode paths. Here is the new relevant sections of Caffe2Targets.cmake: ``` set_target_properties(c10_hip PROPERTIES INTERFACE_INCLUDE_DIRECTORIES "${_IMPORT_PREFIX}/include" INTERFACE_LINK_LIBRARIES "c10;hip::amdhip64" ) ``` ``` set_target_properties(torch_hip PROPERTIES INTERFACE_COMPILE_DEFINITIONS "USE_C10D_NCCL" INTERFACE_COMPILE_OPTIONS "-fPIC;-D__HIP_PLATFORM_AMD__=1;-DCUDA_HAS_FP16=1;-DUSE_ROCM;-D__HIP_NO_HALF_OPERATORS__=1;-D__HIP_NO_HALF_CONVERSIONS__=1;-DTORCH_HIP_VERSION=602;-Wno-shift-count-negative;-Wno-shift-count-overflow;-Wno-duplicate-decl-specifier;-DCAFFE2_USE_MIOPEN;-DTHRUST_DEVICE_SYSTEM=THRUST_DEVICE_SYSTEM_HIP;-std=c++17;-DHIPBLAS_V2;-DHIP_NEW_TYPE_ENUMS" INTERFACE_INCLUDE_DIRECTORIES "${_IMPORT_PREFIX}/include" INTERFACE_LINK_LIBRARIES "c10_hip;torch_cpu_library;hip::amdhip64;MIOpen;hiprtc::hiprtc;roc::hipblaslt;roc::hipblas;hip::hipfft;hip::hiprand;roc::hipsparse;roc::hipsolver" ) ``` HIPCUB dependency was not actually used; which is why it is removed here as the imported target had undesirable side effects. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136283 Approved by: https://github.com/jeffdaily, https://github.com/Skylion007, https://github.com/jithunnair-amd, https://github.com/atalman	2024-09-26 00:34:43 +00:00
Zheng, Zhaoqiong	f3dd1721f4	[Update] Update note for Getting Started with PyTorch on Intel GPUs (#129946 ) remove the hardware and software prerequisites and set up env part. keep the prerequisites section and link to pytorch prerequistes for intel gpus for driver install, intel support package install and env set up https://www.intel.com/content/www/us/en/developer/articles/tool/pytorch-prerequisites-for-intel-gpus.html Update the support for Intel Client GPU MTL-H Update inference & training examples Pull Request resolved: https://github.com/pytorch/pytorch/pull/129946 Approved by: https://github.com/seemethere	2024-09-26 00:22:05 +00:00
PyTorch MergeBot	9223c16208	Revert "Fix constant propagation in builtins and UserClasses (#131354 )" This reverts commit dd4a51b39aa02cba23b3a387b41c5026770d9220. Reverted https://github.com/pytorch/pytorch/pull/131354 on behalf of https://github.com/atalman due to Breaks torchrec tests ([comment](https://github.com/pytorch/pytorch/pull/131354#issuecomment-2375417145))	2024-09-25 23:01:03 +00:00
Bin Bao	ecc15c4f89	[AOTI] Fix a missing aoti_torch_check symbol issue (#136669 ) Summary: When Inductor generates cpp kernels, they should be pure cpp loops which are independent to libtorch as much as possible. Differential Revision: D63403473 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136669 Approved by: https://github.com/henrylhtsang	2024-09-25 22:56:10 +00:00
Huy Do	b7a5c7d331	Do not XFAIL test_segfault in fbcode (#136661 ) https://github.com/pytorch/pytorch/pull/136252 silence the failure on OSS, but the test actually passed on fbcode [T202241133](https://www.internalfb.com/intern/tasks/?t=202241133) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136661 Approved by: https://github.com/malfet	2024-09-25 22:26:24 +00:00
ratnampa	8d65d9f11b	Constraint setuptools to 72.1.0 or older in requirements.txt (#136489 ) FIXES: https://github.com/pytorch/pytorch/issues/136541 Setuptools>=74.0.0 has deprecated support for some functions in distutils, and so the builds run into error such as ```AttributeError: module 'distutils' has no attribute '_msvccompiler'```. Also, the pytorch builds have setuptools pin to 72.1.0 according to these PRs: https://github.com/pytorch/builder/pull/1995 and `89d9a8cf6f`. So, until there is a fix to change the function usage in accordance with latest setuptools, the 72.1.0 version works fine. Also observed in CI jobs: https://github.com/pytorch/pytorch/actions/runs/10979326524 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136489 Approved by: https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2024-09-25 22:06:05 +00:00
Xuan Zhang	c9d12f6360	[inductor][memory] add signpost event for memory pass (#136538 ) Add logging to scuba table for internal models. For verification, I triggered a sample workflow internally and checked the scuba table logging to make sure the `Paramaters` column has the expected loggings, see [here](https://fburl.com/scuba/workflow_signpost/39h7qo9s). Pull Request resolved: https://github.com/pytorch/pytorch/pull/136538 Approved by: https://github.com/yf225	2024-09-25 21:47:46 +00:00
rzou	b5c2a657ae	Add zou3519 to CODEOWNERS for HOPs (#136679 ) There are some tricky things that I want to guard against Pull Request resolved: https://github.com/pytorch/pytorch/pull/136679 Approved by: https://github.com/Chillee	2024-09-25 21:29:48 +00:00
Animesh Jain	289df45cee	Revert "[Dynamo] Trace enter/exit of TorchFunctionModes (#135422 )" (#136590 ) This reverts commit 7743149b2be4a9eba7e0997ccdc6abe552bec266. Reverts * https://github.com/pytorch/pytorch/pull/135503 * https://github.com/pytorch/pytorch/pull/135502 * https://github.com/pytorch/pytorch/pull/135422 This passes this test. Earlier, the getitem would stay like a getitem in the Fx graph. But now the fake tensor propagations fails saying that .item is called. It seems that torch function is not getting triggered while fake tensor propagation. ``` import torch from torch.nn.attention.flex_attention import BlockMask, _mask_mod_signature, _score_mod_signature, flex_attention from torch._inductor.lowering import make_pointwise, register_lowering from torch._inductor.virtualized import ops from torch.nn.attention.flex_attention import create_block_mask torch.set_default_device('cuda') flex_attention = torch.compile(flex_attention, dynamic=False) prefix_lengths = torch.arange(8) def prefix_lm(b, h, q, kv): return prefix_lengths[b] >= kv mask = create_block_mask(prefix_lm, 8, None, 512, 512, _compile=True) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/136590 Approved by: https://github.com/Chillee	2024-09-25 21:10:43 +00:00
Boyuan Feng	529b6ab0bb	[Flex Attention] fix block size order (#136657 ) `create_block_mask` currently gives wrong BLOCK_SIZE and shape when using non-default block size `(128,128)`. This PR fixes the issue by using BLOCK_SIZE order `(Q_BLOCK_SIZE, KV_BLOCK_SIZE)`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136657 Approved by: https://github.com/Chillee, https://github.com/drisspg	2024-09-25 21:08:40 +00:00
Edward Yang	76b044d7cb	Don't actually import module when checking if its valid (#136548 ) Summary: If you actually import the module, you might end up with some import cycle situation where a module is imported too early and accesses things that are not initialized yet. Test Plan: sandcastle and ossci ``` TORCH_LOGS=+torch._inductor.codecache buck run mode/opt caffe2/benchmarks/dynamo:torchbench ``` Differential Revision: D63330224 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136548 Approved by: https://github.com/Skylion007	2024-09-25 20:47:32 +00:00
atalman	11c5f9ac3b	Use amazon linux 2023 runners for Docker builds (#136544 ) Migrate these builds to linux 2023. We want to build and test the Docker images in CD. Looks like we are hitting this issue: https://github.com/docker/buildx/issues/379 when trying to build Docker on Amazon Linux 2023. Conda Docker build is timing out. While Manywheel is executing but failing because BUILDKIT is turned off: https://github.com/pytorch/pytorch/actions/runs/11036043157/job/30653543264?pr=136544 Proposed Solution is to fix it in user_data . Please see: https://github.com/pytorch/test-infra/issues/5712 I see docker builds are executed successfully here: https://github.com/pytorch/pytorch/actions/runs/11040149229/job/30667448668?pr=136544 Workaround timeout problem (reported in https://bugzilla.redhat.com/show_bug.cgi?id=1537564 ) by configuring number of open files per container to 1048576 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136544 Approved by: https://github.com/ZainRizvi Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2024-09-25 20:39:56 +00:00
Xinran / Allan Rui	13b0baf2a1	[FX] Update _inline_module util function to work with both args and kwargs (#136631 ) Summary: Previously `_inline_module ` helper function only works with submodules that have args specified. This diff updates the util function to look for input arguments from submodule kwargs first using placeholder node names, then fallback to list of args if node name not found. Test Plan: ``` buck2 run @//mode/{opt,mtia,inplace} //glow/fb/fx/fba/tests:test_fba_inductor -- -r test_connected_fusions ``` Differential Revision: D63347675 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136631 Approved by: https://github.com/jfix71	2024-09-25 20:20:57 +00:00
Sunishchal Dev	a8ed873ba2	Add missing input "eps" to adam docs (#135191 ) Minor fix for missing input argument in the Adam optimizer docs page. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135191 Approved by: https://github.com/janeyx99	2024-09-25 20:17:23 +00:00
cyy	6aa6bd4ca5	[Distributed] [12/N] Fix clang-tidy warnings in torch/csrc/distributed/ (#136528 ) Follows #136439. A dangling reference to qualifiedName was found and fixed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136528 Approved by: https://github.com/kwen2501	2024-09-25 20:12:08 +00:00
Xiaozhu Meng	5a29a06aa3	[AMD][inductor] do not use float64 on AMD internally (#136441 ) Summary: Internal AMD triton seems to have issue with float64 constant: ``` ### Most recent error lines found on the logs: E0920 13:23:56.391000 2026 torch/_inductor/runtime/triton_heuristics.py:446] [2/2] ^ E0920 13:23:56.391000 2026 torch/_inductor/runtime/triton_heuristics.py:446] [2/2] tmp8 = tl.broadcast_to((libdevice.llrint((tl.full([1], 1.00000000000000, tl.float64))(ks3.to(tl.float64)))) / ks1, [XBLOCK, RBLOCK]) E0920 13:23:56.391000 2026 torch/_inductor/runtime/triton_heuristics.py:446] [2/2] tmp7 = tmp5 + tmp6 E0920 13:23:56.391000 2026 torch/_inductor/runtime/triton_heuristics.py:446] [2/2] tmp6 = 0.5 E0920 13:23:56.391000 2026 torch/_inductor/runtime/triton_heuristics.py:446] [2/2] tmp5 = tmp4.to(tl.float32) E0920 13:23:56.391000 2026 torch/_inductor/runtime/triton_heuristics.py:446] [2/2] tmp4 = (((r3 + (x0((17 + (16ks0ks1)) // 18))) % ks2) // ks0) % ks1 E0920 13:23:56.391000 2026 torch/_inductor/runtime/triton_heuristics.py:446] [2/2] tmp3 = tmp2.to(tl.int1) E0920 13:23:56.391000 2026 torch/_inductor/runtime/triton_heuristics.py:446] [2/2] tmp2 = tmp0 < tmp1 E0920 13:23:56.391000 2026 torch/_inductor/runtime/triton_heuristics.py:446] [2/2] tmp1 = 16ks0ks1 E0920 13:23:56.391000 2026 torch/_inductor/runtime/triton_heuristics.py:446] [2/2] tmp0 = r3 + (x0((17 + (16ks0*ks1)) // 18)) E0920 13:23:56.391000 2026 torch/_inductor/runtime/triton_heuristics.py:446] [2/2] r3 = rindex E0920 13:23:56.391000 2026 torch/_inductor/runtime/triton_heuristics.py:446] [2/2] rmask = rindex < rnumel E0920 13:23:56.391000 2026 torch/_inductor/runtime/triton_heuristics.py:446] [2/2] rindex = roffset + rbase E0920 13:23:56.391000 2026 torch/_inductor/runtime/triton_heuristics.py:446] [2/2] triton.compiler.errors.CompilationError: at 26:15: E0920 13:23:56.391000 2026 torch/_inductor/runtime/triton_heuristics.py:446] [2/2] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns) ``` Bisecting showing this error introduced by D62465575 This diff tries to not convert constant to float64 on AMD, and emu1.4 predictor now can run on AMD with rocm6.0. Test Plan: rocm6.0 can work ``` TORCHINDUCTOR_AUTOTUNE_REMOTE_CACHE=1 HIP_FORCE_DEV_KERNARG=1 HIP_GRAPH=--use-cuda-graph PYTORCH_MIOPEN_SUGGEST_NHWC=1 TORCHINDUCTOR_LAYOUT_OPTIMIZATION=1 CUDA_VISIBLE_DEVICES="2" TORCH_LOGS="recompiles,cudagraphs" buck2 run @//mode/opt-amd-gpu -c fbcode.rocm_ck_rtz=true -m rocm60 fblearner/predictor/py/applications/photogen:ip_python_predictor_photogen_cm -- --model=photogen_v1p4_9b --thrift_server_port=15008 --max_predict_calls=1 --enable_tunable_op --load_from_torch_package=genai:937233660_1 ``` emu1.4 predictor on AMD fails with rocm6.2 with some other triton errors (https://www.internalfb.com/phabricator/paste/view/P1603842354) Differential Revision: D63263806 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136441 Approved by: https://github.com/houseroad	2024-09-25 19:13:17 +00:00
Zain Rizvi	37f340c1e5	[EZ] Remove remaining amz2023 runner variant references (#136540 ) Validated no jobs use the amz2023 runner variant anymore ([proof](https://github.com/search?type=code&q=org%3Apytorch+%2F%5Cbamz2023%5Cb%2F+&p=1)) so removing all references to it Explicit references to the amz2023 runner type variants were removed in the following PRs: - https://github.com/pytorch/ignite/pull/3285 - https://github.com/pytorch/ao/pull/887 - https://github.com/pytorch/fbscribelogger/pull/1 - https://github.com/pytorch/pytorch/pull/134355 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136540 Approved by: https://github.com/huydhn, https://github.com/malfet	2024-09-25 19:01:00 +00:00
David Berard	9c2c61d2dd	[inductor] ELEMENTS_PER_WARP_32 -> ONE_ELEMENT_PER_THREAD (#136472 ) AMD devices have 64 elements per thread; this PR makes the handling of the "ELEMENTS_PER_WARP_32" generic and uses DeviceProperties.warp_size to determine the warp size instead of hard-coding the warp size as 32. It also renames the enum value. Added a unit test for this. Note: I left the old enum option (ELEMENTS_PER_WARP_32) as is instead of renaming it. I'm not sure whether we expect should caches to get invalidated here; if this concern is valid, then there's a risk that this would get updated, but some model could use the cached inductor code, which would reference "ELEMENTS_PER_WARP_32", which would no longer exist. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136472 Approved by: https://github.com/jansel	2024-09-25 18:21:09 +00:00
cyy	a259fbf72c	[2/N] Fix clang-tidy warnings in torch/csrc/lazy (#136634 ) Follows #134655 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136634 Approved by: https://github.com/Skylion007	2024-09-25 18:08:29 +00:00
Tugsbayasgalan (Tugsuu) Manlaibaatar	0b38fa154a	Fix meta registry in export (#136492 ) Summary: Title Test Plan: CI This fixes some breaking tests in executorch. I think the root cause is when we have aten::matmul which we are not preserving, we register meta implementation from C++ side. It seems like the C++ kernel doesn't work well with mix of FakeTensor and real tensor. This PR sidesteps this problem by always preferring python CIA decomp over C++ Cia decomp Differential Revision: D63297050 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136492 Approved by: https://github.com/bdhirsh	2024-09-25 17:53:02 +00:00
Justin Chu	8582835499	[ONNX] Remove the operators test (#136335 ) The tests are obsolete and hard to maintain. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136335 Approved by: https://github.com/xadupre, https://github.com/cyyever Co-authored-by: Edward Z. Yang <ezyang@meta.com>	2024-09-25 17:44:18 +00:00
Edward Z. Yang	7cb6d31567	Dump partially traced make_fx graph in event of error to tlparse (#136508 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/136508 Approved by: https://github.com/zou3519, https://github.com/bdhirsh, https://github.com/malfet ghstack dependencies: #136533	2024-09-25 17:44:15 +00:00
Tugsbayasgalan (Tugsuu) Manlaibaatar	9409274bc1	Fix bug in functional tensor decomp (#136600 ) Summary: Previously we had a very bad bug where we don't allow any decomp on CIA. This never mattered before because we never had to actually push CIA decomp to Python key level in export. Test Plan: CI Differential Revision: D63363749 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136600 Approved by: https://github.com/bdhirsh	2024-09-25 17:37:50 +00:00
David Berard	5d7ed02f52	[user-written triton kernels] specialize exprs if they are expected to be tl.constexpr (#136512 ) Fixes #136504 If you have a tl.constexpr parameter to a triton kernel, and you pass in a SymNode, then, right now, you run into failures (see under 'constants'): ``` File "/tmp/torchinductor_dberard/na/cnax67r5zmslz7bvdfizteaepj7fajpjallb3bu2gyetjcdqtbzj.py", line 14, in <module> triton_meta={'signature': {0: 'fp32', 1: 'fp32'}, 'device': DeviceProperties(type='cuda', index=0, cc=90, major=9, regs_per_multiprocessor=65536, max_threads_per_multi_processor=2048, multi_processor_count=132, warp_size=32), 'constants': {2: s0, 3: 256}, 'configs': [AttrsDescriptor(divisible_by_16=(0, 1), equal_to_1=())]}, torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised: NameError: name 's0' is not defined ``` To fix this, we specialize on the value during dynamo tracing, so that we have a real integer when we do codegen. Alternatives: specialize somewhere else (e.g. inductor); or figure out how to actually pass the value dynamically into the user-written kernel. However, if we try to pass a dynamic value, then we wouldn't be able to precompile the triton kernels in inductor or use AOTI. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136512 Approved by: https://github.com/oulgen, https://github.com/jansel, https://github.com/eellison	2024-09-25 17:12:11 +00:00
Pian Pawakapan	7c6d543a5b	[export] fix _get_non_persistent_buffers for duplicates (#136552 ) Summary: Export's method _get_non_persistent_buffers doesn't check duplicate submodules, so we run into state_dict related issues if non-persistent buffers exist on shared submodules. Test Plan: test_export Differential Revision: D63332976 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136552 Approved by: https://github.com/avikchaudhuri, https://github.com/tugsbayasgalan	2024-09-25 16:46:31 +00:00
Sahan Paliskara	aa80b82cea	[hygiene] Delete dead alerting code (#136583 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136583 Approved by: https://github.com/clee2000	2024-09-25 15:44:46 +00:00
Sergii Dymchenko	0232278b33	Fix comment posting permissions for check-labels.yml (#136610 ) Currently it fails with Error fetching https://api.github.com/repos/pytorch/pytorch/issues/136607/comments HTTP Error 403: Forbidden (see https://github.com/pytorch/pytorch/actions/runs/11026434368/job/30622960113?pr=136607) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136610 Approved by: https://github.com/malfet	2024-09-25 15:43:19 +00:00
Huy Do	34711fe8c9	Fix test_skip_data_serialization pickle exception match (#136617 ) The test is failing in trunk atm with the following error: ``` test_serialization.py::TestSerialization::test_skip_data_serialization_materialize_fake_False - AssertionError: "Can't pickle local object 'WeakValueDictionary.__init__.<locals>.remove'" does not match "Can't get local object 'WeakValueDictionary.__init__.<locals>.remove'" ``` for example, `36f0e61166` This comes from this cpython commit `a3076c734d`, and manifests in python 3.12.5 currently used in CI. The failure doesn't happen when I try it out with 3.12.3 and 3.12.4. Looking at the commit logs of https://github.com/python/cpython/commits/main/Lib/pickle.py, it looks like the exception message is changing back and forth, so I guess a regex match would capture both.	2024-09-25 08:35:46 -07:00
Catherine Lee	deb820602a	viable/strict update: log push to s3 (#136470 ) As stated in https://github.com/pytorch/test-infra/pull/5686, I cannot figure out a way to determine the push time from webhooks (other than when the webhook was sent, but that isn't super accurate either). Instead, manually save a json file to s3 that contains information for the sha and date so that we can still get this information. Relies on https://github.com/pytorch/test-infra/pull/5690 tested in https://github.com/pytorch/pytorch/pull/136387 (but I squashed so it's kinda hard to find now) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136470 Approved by: https://github.com/huydhn	2024-09-25 15:28:53 +00:00
PyTorch MergeBot	e3b89ca124	Revert "Add deterministic path for CUDA `cumsum` (#136224 )" This reverts commit b1a02bf70824a4802411ddd5be1d3610e7a2e269. Reverted https://github.com/pytorch/pytorch/pull/136224 on behalf of https://github.com/ezyang due to Failing internall CI ([comment](https://github.com/pytorch/pytorch/pull/136224#issuecomment-2374201626))	2024-09-25 14:11:01 +00:00
Bin Bao	20a855bf01	[AOTI] Move stack_allocation logic from PythonWrapperCodegen (#136463 ) Summary: Move stack_allocation logic from PythonWrapperCodegen to CppWrapperCpuArrayRef Differential Revision: [D63319970](https://our.internmc.facebook.com/intern/diff/D63319970) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136463 Approved by: https://github.com/chenyang78 ghstack dependencies: #136062, #136461, #136462	2024-09-25 14:06:33 +00:00
PyTorch MergeBot	5171b0e3c6	Revert "[ONNX] Remove the operators test (#136335 )" This reverts commit 9629835b1ccce8e72fc93bf95be13e3d53cb4871. Reverted https://github.com/pytorch/pytorch/pull/136335 on behalf of https://github.com/ezyang due to I'll reland this, bear with me ([comment](https://github.com/pytorch/pytorch/pull/136335#issuecomment-2374183435))	2024-09-25 14:06:03 +00:00
Bin Bao	070952aca5	[AOTI] Move stack_allocation logic from CppWrapperCpu (#136462 ) Summary: Move stack_allocation logic from CppWrapperCpu to CppWrapperCpuArrayRef Differential Revision: [D63300359](https://our.internmc.facebook.com/intern/diff/D63300359) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136462 Approved by: https://github.com/chenyang78 ghstack dependencies: #136062, #136461	2024-09-25 14:03:03 +00:00
Bin Bao	5ad5f40283	[AOTI][reland] Create another wrapper class to handle ArrayRef (#136461 ) Summary: Create another wrapper codegen class to handle ArrayRef for CPU. The goal is to simplify the regular cpp wrapper codegen logic and the generated cpp code. Test Plan: CI Differential Revision: [D63300361](https://our.internmc.facebook.com/intern/diff/D63300361) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136461 Approved by: https://github.com/angelayi, https://github.com/chenyang78 ghstack dependencies: #136062	2024-09-25 14:00:09 +00:00
Edward Z. Yang	25ab87c09b	Add lint rule META_NO_CREATE_UNBACKED (#135870 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/135870 Approved by: https://github.com/albanD	2024-09-25 13:33:56 +00:00
Tom Ritchford	dd4a51b39a	Fix constant propagation in builtins and UserClasses (#131354 ) * Fixes https://github.com/pytorch/pytorch/issues/118675 * Replaces https://github.com/pytorch/pytorch/pull/118994 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131354 Approved by: https://github.com/jansel, https://github.com/anijain2305	2024-09-25 13:03:40 +00:00
Jez Ng	a0c76ea853	Make test_skip_data_serialization regex more flexible (#136580 ) Some CI machines seem to throw "Can't get local object" rather than "Can't pickle local object". Pull Request resolved: https://github.com/pytorch/pytorch/pull/136580 Approved by: https://github.com/mikaylagawarecki	2024-09-25 11:27:23 +00:00
IvanKobzarev	370c1c4297	[aotd] Fix rrelu compilation (#136008 ) Issues: https://github.com/pytorch/pytorch/issues/135083 https://github.com/pytorch/pytorch/issues/120292 rrelu decomposition contains mutation, copy_. Decompositions are executed below Functionalization, as a result AOT produces non-functional graph. Also that decomposition is registered as python_dispatch kernel for AutogradCUDA. Autograd dispatch happens above Functionalization, so registering it for Autograd to handle all backends makes functionalization running after this. Testing: ``` python test/functorch/test_aotdispatch.py -k test_rrelu ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/136008 Approved by: https://github.com/bdhirsh	2024-09-25 11:26:19 +00:00
Wu, Chunyuan	c3fdf587b5	[inductor] [cpp] fix the check of template_buffer_has_other_users if no epilogue_nodes (#136518 ) The `template_buffer_has_other_users` function checks the case where there're epilogue nodes and the template output has users other than these epilogue nodes. When there's no epilogue nodes, the function could return `False` directly. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136518 Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5 ghstack dependencies: #136418	2024-09-25 10:25:07 +00:00
Jokeren	cabfbef6cf	[pytorch][PR] [inductor] More fixes on the keys of `constants` and `signature` dictionaries (#136514 ) Summary: Previous PR forgets to change two other places that also create `constants` and `signature`. Test Plan: Imported from GitHub, without a `Test Plan:` line. {F1884584338} Differential Revision: D63027728 Pulled By: Myrthan Co-authored-by: Jokeren <robinho364@gmail.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/136514 Approved by: https://github.com/jansel Co-authored-by: Jokeren <robinho364@gmail.com>	2024-09-25 09:34:14 +00:00
Wu, Chunyuan	2e30c160ef	[inductor] [cpp] fix max-autotune for single-thread dynamic shapes (#136418 ) Fixes the compilation error of max-autotune for `maml_omniglot` (AMP and FP32) and `soft_actor_critic` (AMP) in Torchbench for single-thread dynamic shapes case: ``` /tmp/torchinductor_user/uv/cuvq6wenwp7us423onuvntkfx4cspmagha5beiknob7tiebzhupa.cpp: In function ‘void kernel(const bfloat16, const bfloat16, const bfloat16, bfloat16, int64_t)’: /tmp/torchinductor_user/uv/cuvq6wenwp7us423onuvntkfx4cspmagha5beiknob7tiebzhupa.cpp:279:41: error: the value of ‘Mr_blocks’ is not usable in a constant expression 279 \| constexpr int64_t m_block_end = Mr_blocks; \| ^~~~~~~~~ /tmp/torchinductor_user/uv/cuvq6wenwp7us423onuvntkfx4cspmagha5beiknob7tiebzhupa.cpp:237:19: note: ‘Mr_blocks’ was not initialized with a constant expression 237 \| const int64_t Mr_blocks = (M + Mr - 1) / Mr; \| ^~~~~~~~~ ``` The PR also updates the UT to add a test for `BS`=512 in single thread. The previous case has `BS`=1024 equal to the `K` and `N` value. The generated code does not have symbolic shapes thus fails to capture the above issue. By adding a case of `BS`=512, the generated code will have symbolic shape for the M dim and is able to reproduce the issue that this PR is addressing. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136418 Approved by: https://github.com/jgong5	2024-09-25 09:24:05 +00:00
Anatoly Myachev	a0a1873148	[Inductor] Fix Triton tests after updating pybind11 to 2.13.6 (#136280 ) https://github.com/pytorch/pytorch/pull/136087 update pybind11 to 2.13.6 and that new release has the feature which is expressed by [a new function](https://pybind11.readthedocs.io/en/latest/changelog.html#version-2-13-6-september-13-2024) `_pybind11_conduit_v1_`. The presence of this function breaks the serialization mechanisms used by Titon and in PyTorch itself. Possible errors that have been noticed due to this change: <details> <summary> the first error </summary> ```bash _________ KernelTests.test_layout_constraint_needs_fixed_stride_order __________ Traceback (most recent call last): File "/runner/_work/intel-xpu-backend-for-triton/intel-xpu-backend-for-triton/pytorch/test/inductor/test_triton_kernels.py", line 1072, in test_layout_constraint_needs_fixed_stride_order eager_out = f(x) File "/runner/_work/intel-xpu-backend-for-triton/intel-xpu-backend-for-triton/pytorch/test/inductor/test_triton_kernels.py", line 1068, in f arange_out(x, y) File "/runner/_work/intel-xpu-backend-for-triton/intel-xpu-backend-for-triton/pytorch/test/inductor/test_triton_kernels.py", line 1059, in arange_out kernel[grid](x, out, n_elements, BLOCK_SIZE=4) File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/triton/runtime/jit.py", line 330, in <lambda> return lambda args, kwargs: self.run(grid=grid, warmup=False, args, *kwargs) File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/triton/runtime/jit.py", line 657, in run kernel = self.compile( File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/triton/compiler/compiler.py", line 315, in compile metadata_group[metadata_filename] = fn_cache_manager.put(json.dumps(metadata, default=vars), metadata_filename, File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/json/__init__.py", line 234, in dumps return cls( File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/json/encoder.py", line 199, in encode chunks = self.iterencode(o, _one_shot=True) File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/json/encoder.py", line 257, in iterencode return _iterencode(o, 0) TypeError: vars() argument must have __dict__ attribute ``` </details> <details> <summary> the second error </summary> ```bash ________________ TestTritonWrapper.test_wrapper_using_gpu_seed _________________ Traceback (most recent call last): File "/cache/pytorch-c5e9d03a2da4b93481737594cbe2f5931fa569aa833f206a638189cad2c36d3c-11/test/inductor/test_triton_wrapper.py", line 40, in test_wrapper_using_gpu_seed out = f(x, y) File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py", line 465, in _fn return fn(args, *kwargs) File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_dynamo/convert_frame.py", line 1292, in __call__ return self._torchdynamo_orig_callable( File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_dynamo/convert_frame.py", line 1087, in __call__ result = self._inner_convert( File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_dynamo/convert_frame.py", line 530, in __call__ return _compile( File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_dynamo/convert_frame.py", line 933, in _compile guarded_code = compile_inner(code, one_graph, hooks, transform) File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_dynamo/convert_frame.py", line 675, in compile_inner return _compile_inner(code, one_graph, hooks, transform) File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_utils_internal.py", line 87, in wrapper_function return function(args, *kwargs) File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_dynamo/convert_frame.py", line 708, in _compile_inner out_code = transform_code_object(code, transform) File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_dynamo/bytecode_transformation.py", line 1322, in transform_code_object transformations(instructions, code_options) File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_dynamo/convert_frame.py", line 220, in _fn return fn(args, kwargs) File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_dynamo/convert_frame.py", line 643, in transform tracer.run() File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_dynamo/symbolic_convert.py", line 2776, in run super().run() File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_dynamo/symbolic_convert.py", line 979, in run while self.step(): File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_dynamo/symbolic_convert.py", line 891, in step self.dispatch_table[inst.opcode](self, inst) File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_dynamo/symbolic_convert.py", line 2967, in RETURN_VALUE self._return(inst) File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_dynamo/symbolic_convert.py", line 2952, in _return self.output.compile_subgraph( File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_dynamo/output_graph.py", line 1117, in compile_subgraph self.compile_and_call_fx_graph(tx, list(reversed(stack_values)), root) File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_dynamo/output_graph.py", line 1369, in compile_and_call_fx_graph compiled_fn = self.call_user_compiler(gm) File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_dynamo/output_graph.py", line 1416, in call_user_compiler return self._call_user_compiler(gm) File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_dynamo/output_graph.py", line 1465, in _call_user_compiler raise BackendCompilerFailed(self.compiler_fn, e).with_traceback( File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_dynamo/output_graph.py", line 1446, in _call_user_compiler compiled_fn = compiler_fn(gm, self.example_inputs()) File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_dynamo/repro/after_dynamo.py", line 130, in __call__ compiled_gm = compiler_fn(gm, example_inputs) File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/__init__.py", line 2235, in __call__ return compile_fx(model_, inputs_, config_patches=self.config) File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_inductor/compile_fx.py", line 1528, in compile_fx return aot_autograd( File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_dynamo/backends/common.py", line 72, in __call__ cg = aot_module_simplified(gm, example_inputs, self.kwargs) File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_functorch/aot_autograd.py", line 1071, in aot_module_simplified compiled_fn = dispatch_and_compile() File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_functorch/aot_autograd.py", line 1056, in dispatch_and_compile compiled_fn, _ = create_aot_dispatcher_function( File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_functorch/aot_autograd.py", line 522, in create_aot_dispatcher_function return _create_aot_dispatcher_function( File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_functorch/aot_autograd.py", line 759, in _create_aot_dispatcher_function compiled_fn, fw_metadata = compiler_fn( File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_functorch/_aot_autograd/jit_compile_runtime_wrappers.py", line 179, in aot_dispatch_base compiled_fw = compiler(fw_module, updated_flat_args) File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_inductor/compile_fx.py", line 1357, in fw_compiler_base return _fw_compiler_base(model, example_inputs, is_inference) File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_inductor/compile_fx.py", line 1428, in _fw_compiler_base return inner_compile( File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_inductor/compile_fx.py", line 479, in compile_fx_inner return wrap_compiler_debug(_compile_fx_inner, compiler_name="inductor")( File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_dynamo/repro/after_aot.py", line 85, in debug_wrapper inner_compiled_fn = compiler_fn(gm, example_inputs) File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_inductor/compile_fx.py", line 665, in _compile_fx_inner compiled_graph = FxGraphCache.load( File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_inductor/codecache.py", line 1341, in load compiled_graph = compile_fx_fn( File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_inductor/compile_fx.py", line 574, in codegen_and_compile compiled_graph = fx_codegen_and_compile(gm, example_inputs, **fx_kwargs) File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_inductor/compile_fx.py", line 882, in fx_codegen_and_compile compiled_fn = graph.compile_to_fn() File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_inductor/graph.py", line 1952, in compile_to_fn return self.compile_to_module().call File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_inductor/graph.py", line 1878, in compile_to_module return self._compile_to_module() File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_inductor/graph.py", line 1906, in _compile_to_module mod = PyCodeCache.load_by_key_path( File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_inductor/codecache.py", line 2866, in load_by_key_path mod = _reload_python_module(key, path) File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_inductor/runtime/compile_tasks.py", line 45, in _reload_python_module exec(code, mod.__dict__, mod.__dict__) File "/tmp/tmps59zkbew/kg/ckgkb4gt5fs5pll4o7fqawppsmdezu5h52cq6nmrvi3yy6j7ddq4.py", line 45, in <module> File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_inductor/async_compile.py", line 198, in triton kernel = TritonCodeCache.load(kernel_name, source_code) File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_inductor/codecache.py", line 2916, in load return _module_to_triton_kernel(PyCodeCache.load(source_code), kernel_name) File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_inductor/codecache.py", line 2853, in load return cls.load_by_key_path(key, path, linemap, attrs) File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_inductor/codecache.py", line 2866, in load_by_key_path mod = _reload_python_module(key, path) File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_inductor/runtime/compile_tasks.py", line 39, in _reload_python_module raise RuntimeError( torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised: RuntimeError: Failed to import /tmp/tmps59zkbew/g3/cg3zgxsidsjhdlz2lzvajvubdq6kg2x2hzd2kznfj43qwvlv33du.py SyntaxError: invalid syntax (cg3zgxsidsjhdlz2lzvajvubdq6kg2x2hzd2kznfj43qwvlv33du.py, line 14) ``` </details> Pull Request resolved: https://github.com/pytorch/pytorch/pull/136280 Approved by: https://github.com/etaf, https://github.com/jansel, https://github.com/EikanWang Co-authored-by: Henry Schreiner <HenrySchreinerIII@gmail.com>	2024-09-25 08:09:46 +00:00
Pei-Hsuan Wu	1cb265fafa	[AILab][attempt2] Add TryExcept when decoding healthcheck port (#136574 ) Summary: ## Context The first attempt has lint error in OSS https://hud.pytorch.org/pr/pytorch/pytorch/136438#30553902641 {F1886895223} ## This Diff Fix error message with try catch Error Message: ``` File "/packages/aps_models.examples.dlrm.lite/dlrm_train_app-inplace#link-tree/torch/distributed/elastic/agent/server/local_elastic_agent.py", line 224, in _setup_healthcheck port=int(healthcheck_port), ValueError: invalid literal for int() with base 10: \'%port.thrift%\' ``` Test Plan: ``` arc lint ``` Reviewed By: felixsu2006 Differential Revision: D63343041 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136574 Approved by: https://github.com/atalman	2024-09-25 04:43:51 +00:00
Nikita Shulga	561cd5a0a6	[BE] Use C++17 convetion methods in CUDA kernels (#136575 ) - `std::is_same<X, Y>::value` -> `std::is_same_v<X, Y>` - `std::enable_if<C, T>::type` -> `std::enable_if_t<C, T>` And so on Pull Request resolved: https://github.com/pytorch/pytorch/pull/136575 Approved by: https://github.com/Skylion007, https://github.com/eqy	2024-09-25 04:30:01 +00:00
Nikita Shulga	5340feb8aa	Disable iOS workflow (#136571 ) See https://github.com/pytorch/pytorch/issues/136284 It's been broken for more than a week and it does not seem like anyone cares about fixing it. Once it's landed I'll reassigned the issue on `oncall: mobile` Pull Request resolved: https://github.com/pytorch/pytorch/pull/136571 Approved by: https://github.com/huydhn, https://github.com/kit1980	2024-09-25 04:29:34 +00:00
Bin Bao	1c9a1a2a19	[AOTI] Support MKL linear ops in cpp wrapper (#134974 ) Summary: Similar to https://github.com/pytorch/pytorch/pull/134475, support mkl linear in the ABI-compatible mode for cpp-wrapper Inductor. Differential Revision: [D63322202](https://our.internmc.facebook.com/intern/diff/D63322202) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134974 Approved by: https://github.com/chenyang78, https://github.com/leslie-fang-intel Co-authored-by: leslie-fang-intel <leslie.fang@intel.com>	2024-09-25 03:53:11 +00:00
chilli	0200ad3457	Turn on unique kernel names (#136503 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136503 Approved by: https://github.com/ezyang, https://github.com/eellison ghstack dependencies: #136509	2024-09-25 03:39:45 +00:00
Nichols A. Romero	482fe186b9	Add ROCm documentation to libtorch (C++) reST. (#136378 ) Fixes #126640 Added ROCm support section to libtorch (C++) reST. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136378 Approved by: https://github.com/ezyang	2024-09-25 02:30:56 +00:00
leslie-fang-intel	3c7edf1ec0	[Inductor][CPP] Fix int8 cvt half (#136353 ) Fix the correctness issue of https://github.com/pytorch/ao/pull/884/. The current implementation for converting between `Half/BFloat16` and `int8/uint8` incorrectly assumes that 1/4 of the int8/uint8 vector lane maps to 1/2 of the Half/BFloat16 vector lane. This assumption leads to accuracy issues after the full bit-width vectorization of the Half data type was introduced. When converting between int8 weights and the half data type, the generated code is as the following: ``` #include "/tmp/torchinductor_leslie/xw/cxww3s7wxrujoyxna7mlcjktid2uu6nntixqwm542xfkd756gl3x.h" extern "C" void kernel(const int8_t* in_ptr0, half* out_ptr0) { { for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(2048L); x0+=static_cast<int64_t>(32L)) { auto tmp0 = at::vec::Vectorized<int8_t>::loadu(in_ptr0 + static_cast<int64_t>(x0), static_cast<int64_t>(32)); auto tmp1 = at::vec::convert<half>(tmp0); tmp1.store(out_ptr0 + static_cast<int64_t>(x0), static_cast<int64_t>(32)); } } } ``` In this PR, we address the issue by changing the implementation to convert 1/2 of the int8/uint8 vector lane into a full vector lane of Half/BFloat16. TestPlan * AO: `python test/integration/test_integration.py -k test_int8_weight_only_quant_subclass_api` * `python -u -m pytest -s -v test/inductor/test_cpu_repro.py -k test_convert_int8_to_half_vec` * Due to the CPP backend legalization pass, we are unable to create a unit test to simulate the conversion from `Half` to `int8`. Instead, we rely on a C++ test case. * `./build/bin/vec_test_all_types_AVX512 --gtest_filter="VecConvertTestsReducedFloat/.ConvertReduced"` `./build/bin/vec_test_all_types_AVX2 --gtest_filter="VecConvertTestsReducedFloat/*.ConvertReduced"` Pull Request resolved: https://github.com/pytorch/pytorch/pull/136353 Approved by: https://github.com/jgong5, https://github.com/jerryzh168	2024-09-25 02:23:43 +00:00
eqy	8225e7706e	[CUDA][Expandable Segments] Account for non-gc'able memory in expandable segments tests (#136496 ) Seems like some other tests are holding onto memory that is not gc'able (e.g., cuBLAS workspaces), so these tests while working in isolation fail when run as e.g., `python test/test_cuda.py -k able` Pull Request resolved: https://github.com/pytorch/pytorch/pull/136496 Approved by: https://github.com/ezyang	2024-09-25 01:14:45 +00:00
Will Cromar	5233b5a448	Update PyTorch/XLA CI image to Python 3.10 (#135278 ) The old image used Python 3.8. Corresponding XLA PR: https://github.com/pytorch/xla/pull/7953 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135278 Approved by: https://github.com/JackCaoG, https://github.com/atalman	2024-09-25 00:53:39 +00:00
eqy	670d64a802	[SDPA][Nested Tensor] Bump `grad_query` fudge factor for small GPUs (#135715 ) Similar to #135711, here we see a ~1/1000 mismatch with absolute value ~0.0016 when 0.001 is allowed Pull Request resolved: https://github.com/pytorch/pytorch/pull/135715 Approved by: https://github.com/drisspg	2024-09-25 00:36:10 +00:00
Pearu Peterson	8f2a4cc4b1	Tune bsr_dense_addmm for int8 inputs on A100 (#136088 ) As in the title. The tuning is done for dimensions 1280 and 5120 that are used in Vit-H. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136088 Approved by: https://github.com/cpuhrsch	2024-09-25 00:24:12 +00:00
Justin Chu	9629835b1c	[ONNX] Remove the operators test (#136335 ) The tests are obsolete and hard to maintain. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136335 Approved by: https://github.com/xadupre	2024-09-24 23:08:48 +00:00
Edward Z. Yang	b57d67e263	Add isuruf to core reviewers (#136554 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/136554 Approved by: https://github.com/Skylion007, https://github.com/malfet	2024-09-24 23:06:46 +00:00
angelayi	210b136c07	[export] Add experimental swap API (#136190 ) Prototyped the following API which takes in an ExportedProgram, a dictionary of fqn to modules to swap, and returns a (unlifted) GraphModule ``` _swap_modules( ep: ExportedProgram, modules_to_swap: Dict[str, torch.nn.Module] ) -> torch.fx.GraphModule: ``` Differential Revision: [D62879819](https://our.internmc.facebook.com/intern/diff/D62879819) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136190 Approved by: https://github.com/avikchaudhuri	2024-09-24 22:50:44 +00:00
PyTorch MergeBot	706eda5cd8	Revert "[RFC][torchelastic][c10d] Fix store prefix race in rendezvous (#135957 )" This reverts commit 5033a1ca0dd22dae34a8939add33dbebfe0fd31d. Reverted https://github.com/pytorch/pytorch/pull/135957 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/135957#issuecomment-2372493186))	2024-09-24 22:24:26 +00:00
William Wen	ae80bce496	[dynamo] refactor resume_execution.py to use bytecode templates (#136483 ) Use bytecode from template instead of hardcoding bytecode in resume_execution.py. Gets rid of a lot of Python-version dependent bytecode generation. Also makes resume_execution.py easier to support in future Python version updates. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136483 Approved by: https://github.com/jansel, https://github.com/anijain2305	2024-09-24 22:20:26 +00:00
Nikita Shulga	36f0e61166	[BE] Use nested namespace in ATen/native/cuda (#136570 ) It's a nice C++17 feature Pull Request resolved: https://github.com/pytorch/pytorch/pull/136570 Approved by: https://github.com/Skylion007	2024-09-24 22:19:10 +00:00
Jeff Daily	1d3af68202	[ROCm] install_miopen.sh exit for ROCm >= 6.3 (#136436 ) Follow up to #132555. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136436 Approved by: https://github.com/jithunnair-amd, https://github.com/pruthvistony, https://github.com/atalman	2024-09-24 22:15:26 +00:00
Justin Chu	780f4debdb	[ONNX] Remove _optimize_graph from public init (#136279 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136279 Approved by: https://github.com/xadupre ghstack dependencies: #136281	2024-09-24 22:00:55 +00:00
Edward Z. Yang	00bc17555a	Don't try to evaluate sympy.Eq in replacement; we knew this wouldn't simplify since we are here (#136533 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/136533 Approved by: https://github.com/isuruf, https://github.com/pianpwk	2024-09-24 21:52:25 +00:00
Kurt Mohler	b1a02bf708	Add deterministic path for CUDA `cumsum` (#136224 ) Change `cumsum` to call its decomposition when `use_deterministic_algorithms(True)` and input is CUDA. Fixes #89492 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136224 Approved by: https://github.com/ezyang, https://github.com/justinchuby	2024-09-24 21:34:43 +00:00
PyTorch MergeBot	0133fbcfe7	Revert "Correctly convert Python float to float64 when passing argument as Tensor (#136413 )" This reverts commit f0f79dd8f1df6cf6342c9c23ae3a9be0f74eb9f5. Reverted https://github.com/pytorch/pytorch/pull/136413 on behalf of https://github.com/ezyang due to forward fix is stuck, revert this ([comment](https://github.com/pytorch/pytorch/pull/136413#issuecomment-2372404873))	2024-09-24 21:20:37 +00:00
Bin Bao	95c0f7493f	[Inductor] Rename WrapperCodeGen to PythonWrapperCodegen (#136062 ) Summary: Rename WrapperCodeGen to PythonWrapperCodegen to make its meaning more explicit. Differential Revision: [D63300358](https://our.internmc.facebook.com/intern/diff/D63300358) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136062 Approved by: https://github.com/angelayi, https://github.com/chenyang78	2024-09-24 21:02:51 +00:00
Yifu Wang	da1560c49f	[SymmetricMemory] add support for cuStreamWriteValue32 (#136488 ) cuStreamWriteValue efficiently combines the issuing of a system-level fence with the update of a single memory location. It is highly suitable for inter-stream progress sharing (e.g., all_gather_with_progress). Exposing it via SymmetricMemory allows users to more easily implement efficient progress-aware matmuls in triton ([xformers example](https://github.com/facebookresearch/xformers/blob/main/xformers/ops/_triton/sequence_parallel_fused_kernels.py)). Pull Request resolved: https://github.com/pytorch/pytorch/pull/136488 Approved by: https://github.com/eqy, https://github.com/Chillee	2024-09-24 20:56:29 +00:00
Justin Chu	7c777dd587	[ONNX] Unify ONNXProgram and remove the old one (#136281 ) ## Note `test_fx_to_onnx_with_onnxruntime.py` is removed for now (it has a lot of xfails anyways). A better version will be added back. Fixes https://github.com/pytorch/pytorch/issues/136274 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136281 Approved by: https://github.com/xadupre, https://github.com/albanD	2024-09-24 20:52:19 +00:00
Will Constable	dbc3356655	[pipelining] fix py ref cycle in stage_backward (#136507 ) TLDR; found forward activation tensors were being kept alive "forever" (or until GC ran), and tracked it down to a cycle involving `stage_backward.<locals>.extract_tensors_with_grads`. The reference cycle in question is below. (constructed using gc.get_referrers after doing a gc.collect in gc debug mode) tensor is kept alive by `[(<class 'cell'>, '0x7f7360234400')]` tuple of cell objects `(<cell at 0x7f73602343d0: function object at 0x7f734fff0ee0>, <cell at 0x7f7360234400: list object at 0x7f734e4d9a80>, <cell at 0x7f73602a4190: list object at 0x7f734eff8b00>)` is kept alive by `[(<class 'function'>, '0x7f734fff0ee0')]` `<function stage_backward.<locals>.extract_tensors_with_grads at 0x7f734fff0ee0>` is kept alive by `[(<class 'cell'>, '0x7f73602343d0')]` Put into more plain terms, ``` def stage_backward(...): ... stage_output_tensors = [] # a cell object will exist that contains the variables defined in stage_backward and used by # both stage_backward and nested functions # in this case, the cell object contains 'stage_output_tensors' but # this function object will hold a reference to a 'cell' that contains any vars from # the parent scope not explicitly passed into the function as args. def extract_tensors_with_grads(...): ... # extract_tensors_with_grads refers to stage_output_tensors, so stage_output_tensors # is in the cell stage_output_tensors.append(output_val) ... # but extract_tensors_with_grads ALSO refers to itself (extract_tensors_with_grads), # so `extract_tensors_with_grads` will be in the cell extract_tensors_with_grads(...) ``` More debug details: https://docs.google.com/document/d/1QPH1Lz0tnieIFPM2tyHrjVB-bjlnHuDgjx1p2am3cmE/edit?usp=sharing In pdb: ``` gc.collect() g = gc.garbage g[-1] [rank0]:(Pdb) [rank0]:<function stage_backward.<locals>.extract_tensors_with_grads at 0x7fee5c3392d0> g[-2] [rank0]:(Pdb) [rank0]:(<cell at 0x7fee7abbcf40: function object at 0x7fee5c3392d0>, <cell at 0x7fee7abbcf70: list object at 0x7fee7ab68940>, <cell at 0x7fee5c3210c0: list object at 0x7fee5e1 d6340>) g[-3] [rank0]:(Pdb) [rank0]:[tensor([[[-4.1127e-06, -3.3826e-06, 2.6226e-06, ..., 6.4969e-06, [rank0]: -4.4405e-06, -4.7684e-06], ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/136507 Approved by: https://github.com/awgu, https://github.com/kwen2501	2024-09-24 20:46:37 +00:00
chilli	7ff8e66140	Fix flexattention sympy expr printer issue (#136509 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136509 Approved by: https://github.com/yanboliang	2024-09-24 20:10:29 +00:00
Henry Tsang	02ef5dd327	[inductor][test] Check if mkl dnn bf16 is supported when using bf16 (#136290 ) Sometimes the test is run with older cpu, e.g. Intel(R) Xeon(R) CPU E5-2680 v4. If we inspect its `lscpu`, in the flags, we don't see a `avx512_bf16`. So that probably means bf16 is not supported for those hardwares, and hence the unit test can fail. So we add the check in the code. Context: https://github.com/pytorch/pytorch/pull/135038 Differential Revision: D62984129 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136290 Approved by: https://github.com/XuehaiPan, https://github.com/chenyang78	2024-09-24 19:32:48 +00:00
Joel Schlosser	888744bd36	NJT binary pointwise broadcasting support via jagged <-> padded dense conversion (#133021 ) Related: #132695 This PR uses padded dense <-> jagged conversions to handle binary pointwise broadcasting of (NT, T) and (T, NT). This includes: * `(B, j0, D) + (1, 1, 1)` * `(B, j0, D) + (B, 1, 1)` * `(B, j0, D) + (B, 1, D)` * etc. This PR also adds (hacky) support for bool inputs to the jagged <-> padded dense conversions. The underlying CUDA kernels do not support integer / bool inputs; so the following workaround is employed: `convert input -> half, run conversion kernel, convert output -> bool`. Note that this bool support is needed specifically for the backward formula of `fmax`, and likely others. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133021 Approved by: https://github.com/cpuhrsch	2024-09-24 19:11:49 +00:00
David Berard	8ecc5f1a8f	[TorchScript][tensorexpr] imbue locale for IRPrinter (#136458 ) We had an internal report where the NNC-generated CUDA code had thousands separators in integer literals. Although I wasn't able to cleanly repro, I did come up with a hacky repro and verified that this fix works (see #136459). Differential Revision: [D63278771](https://our.internmc.facebook.com/intern/diff/D63278771) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136458 Approved by: https://github.com/eellison	2024-09-24 19:00:57 +00:00
Nikita Shulga	c6192f32f1	[MPS] Add upsample_bicubic2d as Metal op (#136123 ) More or less literal copy-n-paste of `c33b0580e6/aten/src/ATen/native/cuda/UpSampleBicubic2d.cu (L24)` and `c33b0580e6/aten/src/ATen/native/cuda/UpSampleBicubic2d.cu (L99)` Missing `uint8` implementation mimics CUDA behavior Initial version coded live in https://www.youtube.com/watch?v=shi6Kb5xxvk Later refinements: - Switch from 2D dispatch to 1D one (to match CUDA behavior) - Added batch + channel loops - Fixed scale computation to match align corners behavior - Added backward implementation Backward implementation again, mimics CUDA, so it has issues precision issue for `torch.half` as well as a somewhat slow simulation of atomic adds using atomic compare and exchange of the pair of adjacent values, i.e. ```metal emplate <typename T> static inline void atomic_add_helper( device atomic<int>* data, long offset, float value) { auto ptr = data + (offset >> 1); auto old = atomic_load_explicit(ptr, memory_order_relaxed); union { int i; T t[2]; } val; do { val.i = old; val.t[offset & 1] += static_cast<T>(value); } while (!atomic_compare_exchange_weak_explicit( ptr, &old, val.i, memory_order_relaxed, memory_order_relaxed)); } ``` Bump basic Metal language version to 3.0, as it's supported on MacOS13 and that's the first version that has `atomic_float` Pull Request resolved: https://github.com/pytorch/pytorch/pull/136123 Approved by: https://github.com/albanD	2024-09-24 18:58:11 +00:00
Animesh Jain	dacf0c4884	[dynamo] Do not treat user defined nn module attributes static for dynamic shape infra (#136516 ) Fixes https://github.com/pytorch/pytorch/issues/136254 Th regression was introduced in https://github.com/pytorch/pytorch/pull/132736 where originally we were trying to fix another regression. This PR and the offending PR together say - "treat user defined nn module attributes as automatic dynamic, but for cudagraphs they will be considered static". This avoid recompilations. This can lead to a cudagraph recording, which is ok. This also maintains the state before inline_inbuilt_nn_modules flag was introduced. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136516 Approved by: https://github.com/williamwen42	2024-09-24 18:26:12 +00:00
Sam Larsen	1028cedf71	[inductor] Enable parallel compile by default in fbcode (#136246 ) Summary: Now that we have subprocess parallel compile on by default, we can change the internal compile_threads default to > 1 with a killswitch. Some jankiness so we can avoid evaluating the justknob at import. Test Plan: Ran codecache tests with JK on, then canaried locally with JK off Differential Revision: D62913998 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136246 Approved by: https://github.com/eellison	2024-09-24 18:10:01 +00:00
Oguz Ulgen	9abdc62065	Allow fx graph caching higher order operators (opt-in) (#135877 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135877 Approved by: https://github.com/zou3519	2024-09-24 17:23:09 +00:00
ankurneog	efed357ef5	Add dtypes support in opinfo for Intel Gaudi (#132840 ) ## Motivation This is following up on changes introduced in https://github.com/pytorch/pytorch/pull/128584 we are adding the dtype information to be picked up while executing the UTs for Intel Gaudi/HPU Pull Request resolved: https://github.com/pytorch/pytorch/pull/132840 Approved by: https://github.com/albanD	2024-09-24 17:17:15 +00:00
PyTorch MergeBot	064093a4d6	Revert "Increase update_hint_regression problem size to 1000 (#136434 )" This reverts commit 3116fbda0fcf9af0c3dfe1280fb7e05e30e6ad5f. Reverted https://github.com/pytorch/pytorch/pull/136434 on behalf of https://github.com/ezyang due to whoops, this is too slow ([comment](https://github.com/pytorch/pytorch/pull/136434#issuecomment-2371847842))	2024-09-24 17:05:20 +00:00
Shangdi Yu	ebfcbe0822	Move print_export_warning so lru_cache works (#136491 ) Summary: as title move print_export_warning() out of the function so `lru_cache` actually works Test Plan: CI Differential Revision: D63297083 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136491 Approved by: https://github.com/pianpwk	2024-09-24 16:52:22 +00:00
Fuzzkatt	44ec706789	add tolerance changes for test_sdpa_autocast in test_nestedtensor.py (#136485 ) Upstreaming minor unit test fix from nvidia internal CI Pull Request resolved: https://github.com/pytorch/pytorch/pull/136485 Approved by: https://github.com/soulitzer	2024-09-24 16:31:32 +00:00
Robert Hardwick	eac04fe72a	Increase bf32 tolerances for some cdist tests in test_torch (#136315 ) - Set the new tolerances ~= N * eps(bfloat16) which should be a comfortable upper bound for tolerances. Where N is the inner dimension of the matmal. Logic behind choice of tolerance: The maximum error of the summation of a series of N numbers in bfloat16 should be `N * epsilon(bfloat16)` , I confirmed by sampling different random seeds that the maximum observed error doesn't exceed this value and is usually much less. Fixes test failures on Arm® Neoverse™ V1 ( not raised as an issue as this hardware type is not currently covered by linux-aarch64 workflow ) ``` Traceback (most recent call last): File "/var/lib/jenkins/workspace/test/test_torch.py", line 2478, in test_cdist_large self.assertEqual(expected, actual) File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 3885, in assertEqual raise error_metas.pop()[0].to_error( AssertionError: Tensor-likes are not close! Mismatched elements: 134118 / 1000000 (13.4%) Greatest absolute difference: 0.03829193115234375 at index (291, 726) (up to 0.005 allowed) Greatest relative difference: 0.03519868478178978 at index (291, 726) (up to 1.3e-06 allowed) ``` @malfet @jondea Pull Request resolved: https://github.com/pytorch/pytorch/pull/136315 Approved by: https://github.com/albanD	2024-09-24 16:10:11 +00:00
Ma Jian	0b667c073e	Disable compiled autograd for re-entrant autograd (#135795 ) Fixes #135298 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135795 Approved by: https://github.com/xmfan	2024-09-24 15:09:16 +00:00
gaopengff	33e10803c8	Fix ut in internal distributed_test.py (#136251 ) I have failed with test case of test_new_subgroups_by_enumeration_input_rank_exceeds_world_size, and passed with this small change. The expected exception is supposed to be "ValueError" rather than "RuntimeError" according to [code](https://github.com/pytorch/pytorch/blob/v2.4.1/torch/distributed/distributed_c10d.py#L4190). Pull Request resolved: https://github.com/pytorch/pytorch/pull/136251 Approved by: https://github.com/kwen2501	2024-09-24 15:06:20 +00:00
Justin Chu	58274e4655	Remove onnx imports in dynamo (#136334 ) Remove imports of the ``torch.onnx.operators`` module in dynamo. Since ONNX depends on dynamo, this import line causes a circular dependency. Judging from the source they are not actually needed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136334 Approved by: https://github.com/xadupre, https://github.com/jansel, https://github.com/titaiwangms	2024-09-24 14:54:23 +00:00
Isuru Fernando	2a178a6982	Avoid changing FTZ/DAZ flags in CPP builder (#136466 ) Fixes https://github.com/pytorch/pytorch/issues/136273 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136466 Approved by: https://github.com/ezyang	2024-09-24 14:39:17 +00:00
Fuzzkatt	6300eb1dc7	tf32 off for test_noncontiguous_samples in test_ops.py (#136484 ) Upstreaming minor unit test fix from nvidia internal CI Pull Request resolved: https://github.com/pytorch/pytorch/pull/136484 Approved by: https://github.com/soulitzer	2024-09-24 14:26:47 +00:00
Amadeusz Skrzypczak	47ebb5856e	Make avoid_device_init() aware of hpu device (#136194 ) Added hpu to devices handled by avoid_device_init() in FakeTensorMode. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136194 Approved by: https://github.com/eellison	2024-09-24 14:13:45 +00:00
enkilee	54fc4f56ff	[Docs fix] fix syntax error in docs :torch.blackman_window (#136354 ) Fixes #ISSUE_NUMBER https://pytorch.org/docs/stable/generated/torch.blackman_window.html error at : equal to torch.blackman_window(L + 1, periodic=False)[:-1]). should delete the last ). Pull Request resolved: https://github.com/pytorch/pytorch/pull/136354 Approved by: https://github.com/soulitzer	2024-09-24 14:00:26 +00:00
Aaron Orenstein	9fc721d22b	Add cache logs + other minor caching cleanup (#136456 ) Summary: - Added TORCH_LOGS=cache to dump cache stats on exit - supported by RemoteCache. - Split REMOTE_CACHE_VERSION - it was used for both JKs fx_graph_memcache_version and autotune_memcache_version but they really should be separate (just in case we need to change one but not the other) - Prepare `_ManifoldCache` for use with other subpath keys - Move create_cache to be more public and use it in codecache - Add _InductorMetaTy alias (still just a dict) - Cleaned up some common cached_autotune calls in triton_heuristics Test Plan: unit tests Reviewed By: oulgen Differential Revision: D62648249 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136456 Approved by: https://github.com/oulgen	2024-09-24 14:00:23 +00:00
IvanKobzarev	342c031f0e	[aotd] Fix freezing API for subclasses (#136265 ) Original issue: https://github.com/pytorch/ao/issues/890 The problem: TracingContext.flat_params contain original params, with not desugared Subclasses. While inductor.freezing API works on aot graphs, which already desugared Subclasses. flat_params are used only for this logic and storing in them desguared subclasses fixes the issue. Testing: ``` python test/functorch/test_aotdispatch.py -k test_inductor_freezing_with_subclasses ``` Torch AO original failure: ``` python test/integration/test_integration.py -k test_int8_weight_only_quant_with_freeze ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/136265 Approved by: https://github.com/bdhirsh	2024-09-24 13:15:01 +00:00
cyy	f048569c24	[Distributed] [11/N] Fix clang-tidy warnings in torch/csrc/distributed/ (#136439 ) Follows #131671 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136439 Approved by: https://github.com/kwen2501	2024-09-24 13:05:15 +00:00
PyTorch MergeBot	538ee7bf60	Revert "Fix tensor.data_ptr() representation overflow (#135567 )" This reverts commit 2e8d431a8fbfdbdb07448195f16afa9e101188ac. Reverted https://github.com/pytorch/pytorch/pull/135567 on behalf of https://github.com/etaf due to Block XPU, let's re-land with triton update. ([comment](https://github.com/pytorch/pytorch/pull/135567#issuecomment-2371200549))	2024-09-24 12:59:14 +00:00
Bob Ren	32727b9859	Add types to _dynamo/testing.py (#136402 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136402 Approved by: https://github.com/jansel	2024-09-24 10:23:54 +00:00
Xuehai Pan	73c10a04f6	[dynamo][easy] support `sys.intern` (#136081 ) Closes #134023 - #134023 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136081 Approved by: https://github.com/anijain2305	2024-09-24 09:12:34 +00:00
Amin Alam	1266be21f4	deprecated datetime.utcnow() fix and _RendezvousJoinOp module initiation bug fix (#136141 ) Fix to #136140 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136141 Approved by: https://github.com/kwen2501	2024-09-24 07:26:10 +00:00
Jianyu Huang	0a35986cdb	Add option to configure reduced precision math backend for SDPA (#135964 ) Summary: Address https://github.com/pytorch/pytorch/issues/135778 by adding a global flag to configure whether using high precision or low precision for math backend of SDPA. Test Plan: buck2 run mode/opt //scripts/feikou/llm:run_attn_kernels Differential Revision: D62625515 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135964 Approved by: https://github.com/jbschlosser	2024-09-24 07:11:38 +00:00
Wu, Chunyuan	44c871c34b	[inductor] [cpp] add index check when fusing epilogue with GEMM template (#135661 ) ## Description Fixes the accuracy failure of FP32 `jx_nest_base` of max-autotune. The current epilogue fusion implementation in GEMM template assumes that the read of template buffer and the write of epilogue output in the epilogue node have the same index (the layout could be different but the index should be the same). If the condition is not satisfied, the computation is wrong, leading to correctness issue for FP32 `jx_nest_base`. This PR disabled the epilogue fusion with GEMM template when the above condition is not satisfied. ### Unsupported epilogue: `buf1` is the template buffer and `buf2` is the epilogue output buffer. The store of `buf2`: 401408 * d0 + 100352 * d1 + *7168 d2 + 1792 * d3** + 128 * d4 + d5 The load of `buf1` in the epilogue node: 401408 * d0 + 100352 * d1 + *1792 d2 + 25088 * d3** + 128 * d4 + d5 The above two indexes are different. ``` CppTemplateBuffer(name='buf1', layout=FixedLayout('cpu', torch.float32, size=[25088, 128], stride=[128, 1])) ComputedBuffer(name='buf2', layout=FixedLayout('cpu', torch.float32, size=[8, 4, 14, 4, 14, 128], stride=[401408, 100352, 7168, 1792, 128, 1]), data=Pointwise( 'cpu', torch.float32, def inner_fn(index): i0, i1, i2, i3, i4, i5 = index tmp0 = ops.load(arg5_1, i5 + 128 * i4 + 1792 * i2 + 25088 * i3 + 100352 * i1 + 401408 * i0) tmp1 = ops.load(buf0, i5 + 128 * i4 + 1792 * i2 + 25088 * i3 + 100352 * i1 + 401408 * i0) tmp2 = tmp0 + tmp1 tmp3 = ops.load(buf1, i5 + 128 * i4 + 1792 * i2 + 25088 * i3 + 100352 * i1 + 401408 * i0) tmp4 = tmp2 + tmp3 return tmp4 , ranges=[8, 4, 14, 4, 14, 128], origin_node=clone, origins=OrderedSet([clone]) )) ``` ### Supported epilogue: `buf1` is the template buffer and `buf2` is the epilogue output buffer. The store of `buf2`: d0 + 576 * d1 + 32 * d2 The load of `buf1` in the epilogue node: d0 + 576 * d1 + 32 * d2 The above two indexes are the same. The layout of `buf2` and `buf1` are different though which is handled by the reindexer: `buf1`: `size=[324, 32], stride=[32, 1]` `buf2`: `size=[1, 32, 18, 18], stride=[10368, 1, 576, 32]` ``` CppTemplateBuffer(name='buf1', layout=FixedLayout('cpu', torch.bfloat16, size=[324, 32], stride=[32, 1])) ComputedBuffer(name='buf2', layout=FixedLayout('cpu', torch.bfloat16, size=[1, 32, 18, 18], stride=[10368, 1, 576, 32]), data=Pointwise( 'cpu', torch.bfloat16, def inner_fn(index): _, i1, i2, i3 = index tmp0 = ops.load(buf1, i1 + 32 * i3 + 576 * i2) tmp1 = ops.to_dtype(tmp0, torch.float32, src_dtype=torch.bfloat16) tmp2 = ops.load(_frozen_param4, i1) tmp3 = tmp1 * tmp2 tmp4 = ops.load(arg7_1, i1 + 32 * i3 + 576 * i2) tmp5 = tmp3 + tmp4 tmp6 = ops.to_dtype(tmp5, torch.bfloat16, src_dtype=torch.float32) return tmp6 , ranges=[1, 32, 18, 18], origin_node=convert_element_type_4, origins=OrderedSet([add, mul, convert_element_type_4]) )) ``` ## TODO Add the support for fusions when the indexes are different in a follow-up PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135661 Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5	2024-09-24 05:25:28 +00:00
Max Podkorytov	7283530db2	[ROCm][Inductor][CK] FP8 gemm (#136337 ) At the moment, lowering torch._scaled_mm with tensorwise scaling and rowwise scaling for both A and B We probably also want to support either combination of tensorwise and rowwise for A and B, as well as bias support Pull Request resolved: https://github.com/pytorch/pytorch/pull/136337 Approved by: https://github.com/chenyang78	2024-09-24 05:19:45 +00:00
Aaron Orenstein	7f98781f84	Fix autodeps from D62049222 that pyfmt broke (#136455 ) Summary: `arc lint` changed the formatting which then caused autodeps to be confused. Test Plan: this passes: ``` arc lint --skip AUTODEPS fbpython fbcode/tools/build/buck/linters/lint_autoformat.py --linter=autodeps --default-exec-timeout=1800 -- fbcode/caffe2/test/inductor/test_memory_planning.py ``` Differential Revision: D63277059 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136455 Approved by: https://github.com/bobrenjc93, https://github.com/oulgen	2024-09-24 05:06:12 +00:00
blzheng	797c7e2802	[Quant][PT2E]change flatten recipe for X86InductorQuantizer (#136298 ) This PR modifies the flatten recipe: if none of the users of the flatten node are quantizable ops, int8 flatten will be disabled to avoid unnecessary dtype conversions. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136298 Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5	2024-09-24 04:30:12 +00:00
Riley Dulin	3be150653c	[torch][ao] Add customizable loss function to NodeAccuracySummary (#136282 ) Summary: Add a customizable loss function callback to NodeAccuracySummary to allow users to pass in their own loss function. Also, fix some type errors and propagate better exception messages when unexpected tensor comparisons occur. Finally, enhance the robustness of `generate_numeric_debug_handle` in the case where it is called multiple times on the same model, by avoiding reuse of the same IDs. Test Plan: Added a test for this case in `test_numeric_debugger`. Reviewed By: jerryzh168 Differential Revision: D62898297 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136282 Approved by: https://github.com/jerryzh168	2024-09-24 03:28:12 +00:00
Guilherme Leobas	e09c5b6046	Remove `vt` argument in `raise_observed_exception` (#136037 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136037 Approved by: https://github.com/zou3519	2024-09-24 02:36:57 +00:00
fduwjj	9372692c7b	[FR] Make OSS fr_trace function available for internal script and improve pg filtering (#136473 ) Differential Revision: [D63287384](https://our.internmc.facebook.com/intern/diff/D63287384/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136473 Approved by: https://github.com/c-p-i-o	2024-09-24 02:34:43 +00:00
Nikita Shulga	4fd16dd8aa	Clarify that `libtorch` API is C++17 compatible (#136471 ) As it relies on some common C++17 primitives, such as `std::optional` Replace all docs references from C++14 to C++17 Fixes https://github.com/pytorch/pytorch/issues/133205 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136471 Approved by: https://github.com/kit1980, https://github.com/atalman	2024-09-24 02:03:33 +00:00
Jez Ng	e4d294221b	[inductor] Log precompilation time (#136395 ) This has been useful for diagnosing the long compile time issues I've seen in the Triton CPU backend. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136395 Approved by: https://github.com/eellison	2024-09-24 01:47:54 +00:00
Edward Z. Yang	802ba79121	Inherit all secrets to inductor workflow (#135354 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/135354 Approved by: https://github.com/desertfire, https://github.com/atalman, https://github.com/malfet	2024-09-24 01:30:40 +00:00
Aaron Orenstein	06909803cc	Existing mypy issues (#136236 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136236 Approved by: https://github.com/bobrenjc93, https://github.com/Skylion007	2024-09-24 01:02:07 +00:00
Xuan Zhang	a14f57b126	fix the inductor tests (#136474 ) Fixes https://github.com/pytorch/pytorch/issues/136464 introduced in https://github.com/pytorch/pytorch/pull/134874 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136474 Approved by: https://github.com/malfet	2024-09-24 00:59:22 +00:00
Nikita Shulga	9d9bc65b5e	Make `FlashAttentionKernel.cpp` compilable for SVE with GCC-11 (#136477 ) Extends https://github.com/pytorch/pytorch/pull/132434 to all minor revisions of GCC-11, as they all likely affected by https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95528 Hattip to @abhishek-iitmadras for the investigation Fixes https://github.com/pytorch/pytorch/issues/136432 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136477 Approved by: https://github.com/atalman, https://github.com/kit1980	2024-09-24 00:54:26 +00:00
Ke Wen	e0f84f40f7	[Pipelining] Allow non-0 stages to accept kwargs (#136416 ) For supporting usage case in torchchat: all non-0 stages requires `input_pos` and `cache_lane`. ``` kwargs = {"input_pos": input_pos, "cache_lane": lane} if pp_rank == first_pp_rank: output = decorder.step(new_token, kwargs) elif pp_rank == last_pp_rank: output = decorder.step(kwargs) else: # middle pp ranks decorder.step(**kwargs) ``` The `forward_one_chunk` code today hard sets `{}` as kwarg for non-0 stages, hence cannot support the above use case. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136416 Approved by: https://github.com/wconstab	2024-09-23 23:50:59 +00:00
Guilherme Leobas	52c917b0ba	Optimize dict reconstruct to not codegen untouched values (#134876 ) PR changes how `reconstruct` is done for a ConstDict. As of today, it works as follow: (1) codegen(...) each pair of key/value (2) create a new dictionary to hold the new items (3) clear the original dictionary (4) update the original dict with the one created in (2) We do a micro optimization in the generated bytecode to: - Only codegen the items that changed. - Only clear the original dictionary if a key was removed. Fixes: #133487 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134876 Approved by: https://github.com/zou3519	2024-09-23 21:45:44 +00:00
fduwjj	5033a1ca0d	[RFC][torchelastic][c10d] Fix store prefix race in rendezvous (#135957 ) 1. We want to take option 3 as discussed in https://github.com/pytorch/pytorch/issues/135712, so every time when we retry, we create a new TCPStore server first so that we don't need to append attempt count as prefix and avoid eventually TCPStore sync failure. (This is only for the TCPStore sharing enabled case) 2. We start a new server bound to an ephemeral port (i.e. 0) so it gets assigned to a free port. We then pass that downstream (trainer or c10d). By doing so, TCPStore is managed by the elastic agent rather than having a race condition on binding to a specific port in the trainer. 3. Then the port be broadcasted for dynamic_rendezvous. Only one more question, what do we do about the store created from (_create_tcp_store) torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py, are we ok with creating a duplicate TCPStore server? Pull Request resolved: https://github.com/pytorch/pytorch/pull/135957 Approved by: https://github.com/d4l3k, https://github.com/c-p-i-o	2024-09-23 20:32:24 +00:00
PyTorch MergeBot	fd182b90a7	Revert "Add deterministic path for CUDA `cumsum` (#136224 )" This reverts commit d45b0151e5d9a9358368b9fbd7fa454edd5d9709. Reverted https://github.com/pytorch/pytorch/pull/136224 on behalf of https://github.com/atalman due to Failing internall CI ([comment](https://github.com/pytorch/pytorch/pull/136224#issuecomment-2369244135))	2024-09-23 19:57:13 +00:00
Nikita Shulga	08dba25775	[BE] Do not use deprecated APIs in SparseCsrTensorMath.cu (#136449 ) - `Tensor::type()` -> `Tensor::scalar_type()` - `Tensor::data<T>()` -> `Tensor::data_ptr<T>()` Should fix following warnings during the compilation: ``` caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/transformers/cuda/mem_eff_attention/kernels/cutlassB_f32_notaligned_k128_dropout.cu.o[0m /var/lib/jenkins/workspace/aten/src/ATen/native/sparse/cuda/SparseCsrTensorMath.cu: In function ‘void at::native::_GLOBAL__N__496f0b0c_22_SparseCsrTensorMath_cu_868dd545::_apply_sparse_csr_linear_solve(const at::Tensor&, const at::Tensor&, bool, const at::Tensor&)’: /var/lib/jenkins/workspace/aten/src/ATen/native/sparse/cuda/SparseCsrTensorMath.cu:739:36: error: ‘T* at::Tensor::data() const [with T = int]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Werror=deprecated-declarations] 739 \| int* rowOffsets = crow.data<int>(); \| ^ /var/lib/jenkins/workspace/build/aten/src/ATen/core/TensorBody.h:247:1: note: declared here 247 \| T * data() const { \| ^ ~~ /var/lib/jenkins/workspace/aten/src/ATen/native/sparse/cuda/SparseCsrTensorMath.cu:740:35: error: ‘T* at::Tensor::data() const [with T = int]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Werror=deprecated-declarations] 740 \| int* colIndices = col.data<int>(); \| ^ /var/lib/jenkins/workspace/build/aten/src/ATen/core/TensorBody.h:247:1: note: declared here 247 \| T * data() const { \| ^ ~~ /var/lib/jenkins/workspace/aten/src/ATen/native/sparse/cuda/SparseCsrTensorMath.cu: In lambda function: /var/lib/jenkins/workspace/aten/src/ATen/native/sparse/cuda/SparseCsrTensorMath.cu:753:44: error: ‘at::DeprecatedTypeProperties& at::Tensor::type() const’ is deprecated: Tensor.type() is deprecated. Instead use Tensor.options(), which in many cases (e.g. in a constructor) is a drop-in replacement. If you were using data from type(), that is now available from Tensor itself, so instead of tensor.type().scalar_type(), use tensor.scalar_type() instead and instead of tensor.type().backend() use tensor.device(). [-Werror=deprecated-declarations] 753 \| AT_DISPATCH_FLOATING_TYPES(values.type(), "create_matrix", ([&] { \| ^ /var/lib/jenkins/workspace/build/aten/src/ATen/core/TensorBody.h:225:1: note: declared here 225 \| DeprecatedTypeProperties & type() const { \| ^ ~~ /var/lib/jenkins/workspace/aten/src/ATen/native/sparse/cuda/SparseCsrTensorMath.cu:753:159: error: ‘c10::ScalarType detail::scalar_type(const at::DeprecatedTypeProperties&)’ is deprecated: passing at::DeprecatedTypeProperties to an AT_DISPATCH macro is deprecated, pass an at::ScalarType instead [-Werror=deprecated-declarations] 753 \| AT_DISPATCH_FLOATING_TYPES(values.type(), "create_matrix", ([&] { \| ^ /var/lib/jenkins/workspace/aten/src/ATen/Dispatch.h:109:1: note: declared here 109 \| inline at::ScalarType scalar_type(const at::DeprecatedTypeProperties& t) { \| ^~~~~~~~~~~ /var/lib/jenkins/workspace/aten/src/ATen/native/sparse/cuda/SparseCsrTensorMath.cu:753:159: error: ‘c10::ScalarType detail::scalar_type(const at::DeprecatedTypeProperties&)’ is deprecated: passing at::DeprecatedTypeProperties to an AT_DISPATCH macro is deprecated, pass an at::ScalarType instead [-Werror=deprecated-declarations] 753 \| AT_DISPATCH_FLOATING_TYPES(values.type(), "create_matrix", ([&] { \| ^ /var/lib/jenkins/workspace/aten/src/ATen/Dispatch.h:109:1: note: declared here 109 \| inline at::ScalarType scalar_type(const at::DeprecatedTypeProperties& t) { \| ^~~~~~~~~~~ /var/lib/jenkins/workspace/aten/src/ATen/native/sparse/cuda/SparseCsrTensorMath.cu: In lambda function: /var/lib/jenkins/workspace/aten/src/ATen/native/sparse/cuda/SparseCsrTensorMath.cu:753:1014: error: ‘T* at::Tensor::data() const [with T = double]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Werror=deprecated-declarations] 753 \| AT_DISPATCH_FLOATING_TYPES(values.type(), "create_matrix", ([&] { \| ^ /var/lib/jenkins/workspace/build/aten/src/ATen/core/TensorBody.h:247:1: note: declared here 247 \| T * data() const { \| ^ ~~ /var/lib/jenkins/workspace/aten/src/ATen/native/sparse/cuda/SparseCsrTensorMath.cu:753:1054: error: ‘T* at::Tensor::data() const [with T = double]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Werror=deprecated-declarations] 753 \| AT_DISPATCH_FLOATING_TYPES(values.type(), "create_matrix", ([&] { \| ^ /var/lib/jenkins/workspace/build/aten/src/ATen/core/TensorBody.h:247:1: note: declared here 247 \| T * data() const { \| ^ ~~ /var/lib/jenkins/workspace/aten/src/ATen/native/sparse/cuda/SparseCsrTensorMath.cu:753:1094: error: ‘T* at::Tensor::data() const [with T = double]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Werror=deprecated-declarations] 753 \| AT_DISPATCH_FLOATING_TYPES(values.type(), "create_matrix", ([&] { \| ^ /var/lib/jenkins/workspace/build/aten/src/ATen/core/TensorBody.h:247:1: note: declared here 247 \| T * data() const { \| ^ ~~ /var/lib/jenkins/workspace/aten/src/ATen/native/sparse/cuda/SparseCsrTensorMath.cu: In lambda function: /var/lib/jenkins/workspace/aten/src/ATen/native/sparse/cuda/SparseCsrTensorMath.cu:753: error: ‘T* at::Tensor::data() const [with T = float]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Werror=deprecated-declarations] 753 \| AT_DISPATCH_FLOATING_TYPES(values.type(), "create_matrix", ([&] { \| /var/lib/jenkins/workspace/build/aten/src/ATen/core/TensorBody.h:247:1: note: declared here 247 \| T * data() const { \| ^ ~~ /var/lib/jenkins/workspace/aten/src/ATen/native/sparse/cuda/SparseCsrTensorMath.cu:753: error: ‘T* at::Tensor::data() const [with T = float]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Werror=deprecated-declarations] 753 \| AT_DISPATCH_FLOATING_TYPES(values.type(), "create_matrix", ([&] { \| /var/lib/jenkins/workspace/build/aten/src/ATen/core/TensorBody.h:247:1: note: declared here 247 \| T * data() const { \| ^ ~~ /var/lib/jenkins/workspace/aten/src/ATen/native/sparse/cuda/SparseCsrTensorMath.cu:753: error: ‘T* at::Tensor::data() const [with T = float]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Werror=deprecated-declarations] 753 \| AT_DISPATCH_FLOATING_TYPES(values.type(), "create_matrix", ([&] { \| /var/lib/jenkins/workspace/build/aten/src/ATen/core/TensorBody.h:247:1: note: declared here 247 \| T * data() const { \| ^ ~~ ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/136449 Approved by: https://github.com/huydhn	2024-09-23 19:20:34 +00:00
Xiaodong Wang	9a1dc41de7	[AMD] Skipping 0 byte send/recv for AMD GPU (#136362 ) Summary: We found jobs getting stuck by send/recv zero bytes with RDMA on AMD GPUs. So just skipping them. Reviewed By: danzimm Differential Revision: D63075000 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136362 Approved by: https://github.com/malfet, https://github.com/houseroad	2024-09-23 19:14:12 +00:00
Edward Z. Yang	3116fbda0f	Increase update_hint_regression problem size to 1000 (#136434 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/136434 Approved by: https://github.com/laithsakka	2024-09-23 18:51:44 +00:00
PyTorch MergeBot	274883083d	Revert "[AOTI] Create another wrapper class to handle ArrayRef (#136318 )" This reverts commit d21841d077b00350d5e621e7b74dace71849c701. Reverted https://github.com/pytorch/pytorch/pull/136318 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/136318#issuecomment-2368957264))	2024-09-23 17:47:49 +00:00
Aleksei Nikiforov	d859fcbc61	s390x: build s390x binaries on each pull request (#125399 ) Ensure that s390x keeps building for each PR Pull Request resolved: https://github.com/pytorch/pytorch/pull/125399 Approved by: https://github.com/huydhn	2024-09-23 17:39:48 +00:00
Joel Schlosser	83a3ee0699	Support embedding_bag() with NJT input (#135888 ) Fixes #93843 `EmbeddingBag()` / `embedding_bag()` support 1D inputs with offsets to handle raggedness. NJT is a natural fit here as it already maintains offsets of the same form. This PR updates the python-side to support NJT and adds corresponding OpInfo-based NJT tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135888 Approved by: https://github.com/cpuhrsch	2024-09-23 17:35:19 +00:00
James Wu	4649aeaebf	Make AOTAutogradCache support remote FXGraphCache (#136173 ) Summary: After the previous refactor, we can now call load_with_key directly from AOTAutogradCache to use the remote FXGraphCache. This does not implement a remote AOTAutogradCache. It just allows AOTAutogradCache to work with remote FXGraphCache. Test Plan: (Meta only tests) Reviewed By: aorenste Differential Revision: D62384944 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136173 Approved by: https://github.com/oulgen	2024-09-23 17:24:27 +00:00
Nikita Shulga	c3e678382b	Fix addmm silent correctness on aarch64 (#136371 ) Do not dispatch to fast gemmv functions when alpha is not equal to 1 Add regression test to address the problem Fixes https://github.com/pytorch/pytorch/issues/136299 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136371 Approved by: https://github.com/swolchok	2024-09-23 17:10:34 +00:00
Edward Z. Yang	f0f79dd8f1	Correctly convert Python float to float64 when passing argument as Tensor (#136413 ) I can't actually test the Dynamo codegen fix as it is impossible to directly use the Tensor at the moment. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/136413 Approved by: https://github.com/bobrenjc93	2024-09-23 16:48:08 +00:00
wz337	637d5c4b7e	[DSD] Fix loading uneven full tensor into sharded state dict (#136365 ) Fix #136228. This is a follow up on https://github.com/pytorch/pytorch/pull/135725. We need to pass shape and stride from the original dtensor, since for uneven case, `from_local` would calculate shape and stride assuming the tensor is evenly-sharded based on the local tensor. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136365 Approved by: https://github.com/fegin	2024-09-23 16:35:58 +00:00
fduwjj	da51fe1c42	[FR] Fix errors in all2all check, improve some log output (#136399 ) We found that we show the hashed pg name in our script output, which is not UX friendly. Also we found a bug in our all2all check and we made a bunch of changes to error messages to make it better readable. Differential Revision: [D63206469](https://our.internmc.facebook.com/intern/diff/D63206469) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136399 Approved by: https://github.com/c-p-i-o	2024-09-23 16:31:31 +00:00
PyTorch MergeBot	df6a8fa1eb	Revert "[aotd] Fix freezing API for subclasses (#136265 )" This reverts commit cdef760560049ebda5fb7e30b1703f345fe05cfa. Reverted https://github.com/pytorch/pytorch/pull/136265 on behalf of https://github.com/atalman due to Breaks internal CI sorry, need to revert ([comment](https://github.com/pytorch/pytorch/pull/136265#issuecomment-2368772574))	2024-09-23 16:25:05 +00:00
Andrew Gu	9992084f38	[FSDP2] Fixed `test_all_gather_extensions_monkey_patch` (#136130 ) I messed up the test before. The extensions were not running :/ Pull Request resolved: https://github.com/pytorch/pytorch/pull/136130 Approved by: https://github.com/weifengpy ghstack dependencies: #136129	2024-09-23 15:12:44 +00:00
Andrew Gu	b9f53c0dce	[FSDP2] Added module, mp policy to `fsdp_pre_all_gather` (#136129 ) - Sometimes having access to the `MixedPrecisionPolicy` in the `fsdp_pre_all_gather` is useful. See [here](https://github.com/pytorch/ao/pull/748/files#r1760375325) in the torchao INT8 mixed precision training PR. - Sometimes having access to the owning `nn.Module` allows for using it for saving state. See [here](https://github.com/pytorch/pytorch/issues/114299#issuecomment-2298692762) for an example. The major paint point here is how to deal with backward compatibility. For now, we use `signature.inspect` to check if the user subclass follows the old vs. new signature. However, for the new signature, the `param_dtype` in the post-all-gather is redundant, as if the user needed it, the user could save it from the `mp_policy` passed in the pre-all-gather now. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136129 Approved by: https://github.com/weifengpy	2024-09-23 15:12:36 +00:00
Bin Bao	d21841d077	[AOTI] Create another wrapper class to handle ArrayRef (#136318 ) Summary: Create another wrapper codegen class to handle ArrayRef for CPU. The goal is to simplify the regular cpp wrapper codegen logic and the generated cpp code. Test Plan: CI Differential Revision: D62961885 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136318 Approved by: https://github.com/frank-wei	2024-09-23 15:10:27 +00:00
PyTorch MergeBot	0e19522122	Revert "Adds support for accelerated sorting with x86-simd-sort (#127936 )" This reverts commit 239a9ad65eebf93dcf9bb108a5129d4160b12c86. Reverted https://github.com/pytorch/pytorch/pull/127936 on behalf of https://github.com/atalman due to test/test_sort_and_select.py::TestSortAndSelectCPU::test_sort_discontiguous_slow_cpu_float32 [GH job link](https://github.com/pytorch/pytorch/actions/runs/10994904767/job/30525578456) [HUD commit link](`239a9ad65e`) ([comment](https://github.com/pytorch/pytorch/pull/127936#issuecomment-2368522316))	2024-09-23 14:52:23 +00:00
Edward Z. Yang	bae427e4b1	Refactor maybe_evaluate_static into a worker function off of ShapeEnv (#135107 ) By refactoring this way, I can put a non-expiring LRU cache here. Splitting also will make it easier for me to tell who is using up all the time. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/135107 Approved by: https://github.com/aorenste	2024-09-23 14:39:20 +00:00
PyTorch MergeBot	e9bfbf78d5	Revert "Allow fx graph caching higher order operators (opt-in) (#135877 )" This reverts commit 66d5eb64e0be91680a8531ccb24f098554610d46. Reverted https://github.com/pytorch/pytorch/pull/135877 on behalf of https://github.com/jeanschmidt due to seems to have introduced regressions on rocm signals ([comment](https://github.com/pytorch/pytorch/pull/135877#issuecomment-2367616653))	2024-09-23 09:04:24 +00:00
cyy	75f141be62	Avoid unnecessary CMake warnings on Windows (#136393 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/136393 Approved by: https://github.com/ezyang	2024-09-23 06:42:59 +00:00
Yuxin Wu	663e760065	add unittest for OOM message (#129671 ) Add unittest for the bug in #123984 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129671 Approved by: https://github.com/eqy	2024-09-23 04:48:01 +00:00
Yiming Zhou	068fdd602f	[export] enable custom tag metadata re-export test (#136048 ) Improves and enables a commented out test originally introduced in #131912 In `test_custom_tag_metadata_re_export()`, we check the added "custom" metadata to given nodes is preserved and not copied to other nodes after re-exporting Pull Request resolved: https://github.com/pytorch/pytorch/pull/136048 Approved by: https://github.com/zhxchen17	2024-09-23 04:37:58 +00:00
Oguz Ulgen	66d5eb64e0	Allow fx graph caching higher order operators (opt-in) (#135877 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135877 Approved by: https://github.com/zou3519	2024-09-23 04:33:27 +00:00
cyy	a38e4c5e1e	Enable clang-tidy warnings on aten/src/ATen/cuda/*.cpp (#134547 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/134547 Approved by: https://github.com/ezyang	2024-09-23 03:44:55 +00:00
Isuru Fernando	f276da7f98	Remove prims.slice_in_dim and prims.slice (#136150 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136150 Approved by: https://github.com/ezyang	2024-09-23 01:27:22 +00:00
Xilun Wu	3406ac24d9	[BE] fix circular import in torch/distributed/utils.py (#136286 ) Summary Fix circular import in `torch/distributed/utils.py` found when running internal test, see D62901023. Curious why this wasn't causing any issue. Is this relevant code deprecated and no longer used? Pull Request resolved: https://github.com/pytorch/pytorch/pull/136286 Approved by: https://github.com/Skylion007	2024-09-22 20:54:12 +00:00
Shangdi Yu	3bc073d728	[aoti] Fix workspace generation for triton (#135552 ) Fixes #131337 - add `arg_type` for workspace_arg, the type is consistent with the type in `generate_workspace_allocation()`. - do not generate example tensors for `workspace`, and use `generate_workspace_allocation()` instead. - add workspace allocation generation code to `kernel_autotune_calls`. e.g. ```python workspace = empty_strided_cuda((1280, ), (1, ), torch.uint8) workspace.zero_() ..... triton_spl_fused_add_cumprod_0.run(buf2, arg0_1, arg1_1, workspace, 1, 10000, grid=split_scan_grid(1, 10000), stream=stream0) del buf2, arg0_1, arg1_1, workspace ``` - add `empty_strided_cuda = torch._C._dynamo.guards._empty_strided_cuda` to the header of triton autotune code. The generated cpp has lines like below, so we also implement a `zero_()` for ` AtenTensorHandle `. ```cpp static constexpr int64_t int_array_0[] = {1280L, }; static constexpr int64_t int_array_1[] = {1L, }; AtenTensorHandle workspace_handle; AOTI_TORCH_ERROR_CODE_CHECK(aoti_torch_empty_strided(1, int_array_0, int_array_1, cached_torch_dtype_uint8, cached_torch_device_type_cuda, 0, &workspace_handle)); RAIIAtenTensorHandle workspace(workspace_handle); workspace.zero_(); ``` - Fix handle grid_fn for grid computation. Pass in "RBLOCK" to `split_scan_grid` - Fix dynamic shapes: Without the fix we generate code that looks like this `workspace = empty_strided_cuda((32((255 + s0) // 256), ), (1, ), torch.uint8)` when doing triton autotune and `s0` is not defined. The solution approach is to use `V.graph.sizevars.size_hint(nbytes)` to realize the workspace size for triton autotune. Note that we only realize it for triton autotune code, but not for the cpp cuda code. - We also generate slightly different cpp code depending on if `abi_compatible` is turned on. ```cpp RAIIAtenTensorHandle workspace(workspace_handle); AOTI_TORCH_ERROR_CODE_CHECK(aoti_torch_zero_(workspace.get())); ``` vs ```cpp at::Tensor workspace = at::detail::empty_strided_cuda({8L(c10::div_floor_integer(static_cast<int64_t>((255L + s0)), static_cast<int64_t>(256L))), }, {1L, }, at::kByte, c10::DeviceType::CUDA); workspace.zero_(); ``` Test Plan: ``` TORCHINDUCTOR_ABI_COMPATIBLE=1 TORCHINDUCTOR_CPP_WRAPPER=1 python test/inductor/test_torchinductor.py -k GPUTests.test_consecutive_split_cumprod_cuda python test/inductor/test_cuda_cpp_wrapper.py TestCudaWrapper.test_consecutive_split_cumprod_cuda_cuda_wrapper python test/inductor/test_cuda_cpp_wrapper.py DynamicShapesCudaWrapperCudaTests.test_consecutive_split_cumprod_cuda_dynamic_shapes_cuda_wrapper TORCHINDUCTOR_ABI_COMPATIBLE=1 python test/inductor/test_cuda_cpp_wrapper.py TestCudaWrapper.test_consecutive_split_cumprod_cuda_cuda_wrapper TORCHINDUCTOR_CPP_WRAPPER=1 python test/inductor/test_torchinductor.py -k GPUTests.test_consecutive_split_cumprod_cuda ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/135552 Approved by: https://github.com/desertfire	2024-09-22 04:51:37 +00:00
Zhou, Lingzhi	35532fc477	[Partitioner] Reuse partition to check whether nodes exist (#135317 ) The time complexity of find node whether in NodeList is O(n). Reuse partition to speed up due to partition.nodes is hash table and has same elements. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135317 Approved by: https://github.com/ezyang	2024-09-21 23:52:02 +00:00
cyy	e4cdc31227	[14/N] Fix clang-tidy warnings in aten/src/ATen (#133988 ) Follows #133807 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133988 Approved by: https://github.com/ezyang	2024-09-21 22:41:40 +00:00
Bob Ren	9731ccb9e0	Type _dynamo/variables/lazy.py (#136376 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136376 Approved by: https://github.com/Skylion007	2024-09-21 22:18:02 +00:00
Jovian Anthony Jaison	09715638ab	Add _dynamo.config.suppress_errors logging (#136379 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136379 Approved by: https://github.com/ezyang	2024-09-21 21:00:26 +00:00
Aaron Orenstein	3176966732	update cache tests (#136215 ) Summary: - Clean up cache test code a bit. - Removed patch_fbcode() - it turned out to cause flaky issues (image if it set fbcode=False and then loaded a module for the first time which had a top-level fbcode check). Test Plan: unit tests Reviewed By: oulgen Differential Revision: D62648248 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136215 Approved by: https://github.com/bobrenjc93	2024-09-21 20:36:22 +00:00
Ramana Sundararaman	be4b7e8131	Param fixes in docstring (#136097 ) Fixes wrong param names in docstrings. cc: @kit1980 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136097 Approved by: https://github.com/ezyang	2024-09-21 18:56:34 +00:00
Aaron Gokaslan	b6ffa381e1	[BE]: Add half CUDA support nextafter (#136373 ) Making CUDA support match CPU support for nextafter Pull Request resolved: https://github.com/pytorch/pytorch/pull/136373 Approved by: https://github.com/ezyang	2024-09-21 17:13:45 +00:00
PyTorch MergeBot	cc17d58809	Revert "S390x update builder image (#132983 )" This reverts commit 080a249fc2290602402e01bf5864d9d9a416e5b6. Reverted https://github.com/pytorch/pytorch/pull/132983 on behalf of https://github.com/atalman due to Authenticate With PUSH is failing. Error: no registries found in registries.conf, a registry must be provided. Error: Process completed with exit code 125. ([comment](https://github.com/pytorch/pytorch/pull/132983#issuecomment-2365249249))	2024-09-21 16:46:54 +00:00
Xuan Zhang	03957efa5d	[inductor][scheduler] reorder scheduler nodes after fusion to reduce peak memory (#134874 ) Motivations: A topological order of the scheduler nodes that optimize the liveness of buffers can reduce the peak memory utilization. This has been observed and studied e.g., [here](https://arxiv.org/pdf/1910.02653) and [here](https://proceedings.mlr.press/v202/steiner23a/steiner23a.pdf). Solutions: 1. implement a peak memory estimator via liveness analysis 2. implement a few memory aware topological sorting algorithms and pick the one with the lowest peak memory Results: On some models we can reduce the peak memory significantly: \| model \| batch size \| peak_memory baseline \| peak_memory new \| ratio \| \|:-----------------------------:\|:----------:\|:--------------------:\|:---------------:\|:-----:\| \| alexnet \| 128 \| 1.17 \| 0.99 \| 1.19 \| \| vgg16 \| 64 \| 4.10 \| 3.57 \| 1.15 \| \| DebertaV2ForQuestionAnswering \| 1 \| 11.60 \| 10.56 \| 1.10 \| In the presence of compiler based AC, peak memory can be further reduced: \| model \| batch size \| peak_memory baseline \| peak_memory new \| ratio \| \|:------------------------------:\|:----------:\|:--------------------:\|:---------------:\|:-----:\| \| AlbertForMaskedLM \| 4 \| 6.87 \| 6.43 \| 1.07 \| \| AlbertForQuestionAnswering \| 4 \| 8.69 \| 7.76 \| 1.12 \| \| MobileBertForQuestionAnswering \| 128 \| 4.67 \| 3.90 \| 1.20 \| [Here](https://fb.workplace.com/groups/1075192433118967/posts/1499920537312819/?comment_id=1499938843977655&reply_comment_id=1499951630643043) is an internal use case. Other infos: * neutral model runtime, because the the reordering happens after fusion. So memory saving is _for free_. * minimal compile time overhead as the algorithm is linear in the number of edges of the inductor graph. For all hugglingface benchmark models, the additional compile time is less than 1 second. * no peak memory regression since we only adopt a new order if the peak memory is reduced based on the estimator. However, the model is unaware of operators' working memories, but for large models, the working memory should be negligible. We haven't observed any significant regressions on all of our tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134874 Approved by: https://github.com/yf225	2024-09-21 16:28:38 +00:00
DavidGu-Datong	fb4670a1f9	fix mean_out: op does not update parameter out for BF16/FP16 dtype on CPU (#135174 ) Fixes #134848 For BF16/FP16, when a tensor is specified in `out` parameter of mean, the mean kernel should use its storage for output, but that doesn't happen, since an `at::to` in the current code causes storage to be allocated again, but the `out` parameter tensor's storage doesn't get updated, resulting in it not holding the mean output. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135174 Approved by: https://github.com/soulitzer	2024-09-21 14:21:42 +00:00
Will Constable	ea737e4e5d	[Pipelining] Make PipelineStage support meta initialization (#136243 ) Avoid allocating memory or dry-running the submodule during stage init. Save user-provided input/output metadata during stage init, to allow lazily initializing the buffers before the first step call. Later, we plan to build on top of this to add lazy shape inference (#130856) so that no input/output shapes are required at stage init. For now, we require input/output tensors for stage init, but these should be on meta device and stage should not allocate any real memory. Note: this needs more thorough testing and review, but it worked on the torchtitan 3d test. TODO: - delete 'device' arg from PipelineStage ctor? (move it to inferred from args tensors passed to first step call? separate PR. - delete 'output_args' from PipelineStage ctor? we don't actually need it, but we use it to do shape validation, which is why I didn't remove it in this PR. Proposal: leave it until we add lazy shape inference? Fixes #136225, #136226 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136243 Approved by: https://github.com/H-Huang, https://github.com/kwen2501	2024-09-21 09:47:22 +00:00
cyy	c459430558	Pass Werror to CUDA host compiler (#130213 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/130213 Approved by: https://github.com/ezyang	2024-09-21 08:01:06 +00:00
Menglu Yu	e18439113e	[PT2][Inductor][Optmus] fix test_pad_mm_bf16 and reland to fix long computation kernel (#136349 ) Summary: see D62220158 Test Plan: ``` buck2 test 'fbcode//mode/opt' fbcode//caffe2/test/inductor:pad_mm -- --exact 'caffe2/test/inductor:pad_mm - test_pad_mm_bf16 (caffe2.test.inductor.test_pad_mm.PadMMTest)' --run-disabled ``` ### H100 Buck UI: https://www.internalfb.com/buck2/e5d85802-cab7-41a5-aacc-95f541796a99 Test UI: https://www.internalfb.com/intern/testinfra/testrun/9570149258587374 Network: Up: 9.1KiB Down: 0B (reSessionID-b339b51b-6a0e-4347-9414-1ba38f26a5d0) Jobs completed: 9. Time elapsed: 1:15.7s. Cache hits: 0%. Commands: 3 (cached: 0, remote: 0, local: 3) Tests finished: Pass 1. Fail 0. Fatal 0. Skip 1. Build failure 0 ### A100 Buck UI: https://www.internalfb.com/buck2/1082ad6e-56b0-4eb5-8092-ce507ca9a70e Test UI: https://www.internalfb.com/intern/testinfra/testrun/8444249533824784 Network: Up: 9.2KiB Down: 0B (reSessionID-2b3056ac-f29e-4de4-b6f5-9d994acf566b) Jobs completed: 9. Time elapsed: 1:36.9s. Cache hits: 0%. Commands: 3 (cached: 0, remote: 0, local: 3) Tests finished: Pass 2. Fail 0. Fatal 0. Skip 0. Build failure 0 # E2E see D62220158 Differential Revision: D63040455 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136349 Approved by: https://github.com/dshi7	2024-09-21 06:35:50 +00:00
cyy	02871461f7	Fix clang-tidy warnings in torch/csrc/lazy (#134655 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/134655 Approved by: https://github.com/ezyang	2024-09-21 02:59:35 +00:00
Laith Sakka	0b91e7e2dc	Remove duplicate line (#136383 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136383 Approved by: https://github.com/kit1980, https://github.com/malfet	2024-09-21 01:35:13 +00:00
eqy	29f7b8d483	[TF32] Account for TF32 in `test_conv_double_backward` (#135716 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135716 Approved by: https://github.com/Skylion007	2024-09-21 01:06:22 +00:00
Nikita Shulga	7936584a88	Fix `Vectorized<double>::next_after` SVE compilation (#136388 ) Should have called [`Sleef_nextafterdx_sve`](https://sleef.org/2-references/libm/aarch64#vectorized-double-precision-function-for-obtaining-the-next-representable-fp-value) rather than [`Sleef_nextafterfx_sve`](https://sleef.org/2-references/libm/aarch64#vectorized-single-precision-function-for-obtaining-the-next-representable-fp-value) to get vectorized `nextafter` for double precision rather than single precision values This fixes a compilation issue introduced by https://github.com/pytorch/pytorch/pull/119571 and exposed by https://github.com/pytorch/pytorch/pull/133339 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136388 Approved by: https://github.com/kit1980	2024-09-20 23:54:17 +00:00
albanD	067d203b22	Upgrade pybind11 API calls for 3.13t (#136370 ) This is a modified version of https://github.com/pytorch/pytorch/pull/130341 that preserve support for older pybind version. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136370 Approved by: https://github.com/Skylion007, https://github.com/malfet	2024-09-20 23:09:55 +00:00
Colin Peppler	1a10751731	[AOTI][Tooling] Filter out kernels based off lowercase names (#135395 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135395 Approved by: https://github.com/YUNQIUGUO	2024-09-20 21:56:08 +00:00
Isuru Fernando	0c936c3ecb	Add decomps for max_unpool (#133146 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133146 Approved by: https://github.com/amjames, https://github.com/eellison	2024-09-20 21:35:25 +00:00
侯奇	293fccf86d	add TORCH_CUDA_CPP_API for AutoNcclGroup (#130012 ) `torch::cuda::nccl` is an option for developers to depend only on torch but not nccl. But to use `torch::cuda::nccl::send`/`torch::cuda::nccl::recv`, `ncclGroupStart()`/`ncclGroupEnd()` is needed, `torch::cuda::nccl::AutoNcclGroup` can be used. but `torch::cuda::nccl::AutoNcclGroup` is not exported and is LOCAL symbol, which can't be used from outside of libtorch. <img width="1618" alt="image" src="https://github.com/pytorch/pytorch/assets/1913192/25b0bd54-2da6-480f-876d-b05acfecfe62"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/130012 Approved by: https://github.com/kwen2501, https://github.com/eqy	2024-09-20 21:20:25 +00:00
Matthew Sterrett	239a9ad65e	Adds support for accelerated sorting with x86-simd-sort (#127936 ) Adds x86-simd-sort as a submodule to accelerate sorting for 32-bit and 64-bit datatypes when AVX2 or AVX512 are available. For contiguous data, this can be over a 10x speedup for large arrays. For discontiguous data, it can give over a 4x speedup with larger arrays. These benchmarks were gathered on a Skylake system (7900x), limited to 8 threads. <details> <summary><b>Contiguous Benchmarks</b></summary> ``` float32, normally distributed (in microseconds) size Default AVX2 AVX512 Default/AVX2 Default/AVX512 16 7.150844336 6.886271477 7.132277489 1.038420335 1.002603214 128 9.208030939 8.478154898 7.846915245 1.086089019 1.173458697 1024 37.79037627 23.60707456 16.44122627 1.600807257 2.298513241 10000 714.7355628 203.9921844 105.5683001 3.503739934 6.770361577 100000 8383.074408 721.6333354 465.3709247 11.61680593 18.01374766 1000000 97124.31945 5632.054572 3920.148401 17.24491803 24.77567416 10000000 1161974.907 86070.48988 71533.82301 13.50027063 16.24371323 int32_t, uniformly distributed (in microseconds) size Default AVX2 AVX512 Default/AVX2 Default/AVX512 16 7.203208685 6.92212224 7.014458179 1.040606975 1.026908779 128 8.972388983 8.195516348 7.592543125 1.094792396 1.18173698 1024 32.77489477 23.6874548 15.36617105 1.383639359 2.132925285 10000 607.8824128 193.3402024 99.25090471 3.144107667 6.124703997 100000 523.9384684 608.1836536 442.3166784 0.861480682 1.184532472 1000000 5211.348627 5271.598405 3518.861883 0.988570871 1.480975611 10000000 133853.6263 81463.05084 67852.97394 1.643120714 1.972700952 ``` </details> Note that the int32_t sort is accelerated by FBGEMM's radix sort for larger arrays, but this only handles contiguous data and in one sorting direction. <details> <summary><b>Discontiguous Benchmarks</b></summary> ``` float, normal distributed, discontiguous in sorted dimension (in microseconds) size Default AVX2 AVX512 Default/AVX2 Default/AVX512 16 3.836543679 4.011214256 3.84376061 0.956454439 0.99812243 128 5.755310194 5.755723127 4.820394962 0.999928257 1.193949923 1024 49.46946019 24.78790785 15.47874362 1.995709379 3.195960952 10000 665.2505291 236.6165959 143.9490662 2.811512551 4.621429974 100000 4328.002203 1329.001212 818.3516414 3.256582586 5.288682743 1000000 47651.5018 16693.72045 11827.39551 2.854456677 4.028909133 10000000 556655.1288 236252.6258 184215.9828 2.356185998 3.021752621 int32_t, uniformly distributed, discontiguous in sorted dimension (in microseconds) size Default AVX2 AVX512 Default/AVX2 Default/AVX512 16 3.817994356 3.878117442 3.770039797 0.984496837 1.012719908 128 5.578731397 5.577152082 4.716770534 1.000283176 1.182743862 1024 43.3412619 23.61275801 14.55446819 1.835501887 2.977866408 10000 634.3997478 224.4322851 133.9518324 2.826686667 4.736028889 100000 4084.358152 1292.363303 781.7867576 3.16037924 5.22438902 1000000 46262.20465 16608.35284 11367.51817 2.785478192 4.06968381 10000000 541231.9104 235185.1861 180249.9294 2.301301028 3.002674742 ``` </details> Pull Request resolved: https://github.com/pytorch/pytorch/pull/127936 Approved by: https://github.com/jgong5, https://github.com/peterbell10	2024-09-20 21:19:33 +00:00
cyy	d2455b99fb	Use cpython declaration of _PyWeakref_ClearRef (#136300 ) To avoid the DLL inconsistency warning by MSVC: ``` torch/csrc/utils/python_compat.h(38): warning C4273: '_PyWeakref_ClearRef': inconsistent dll linkage ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/136300 Approved by: https://github.com/Skylion007	2024-09-20 18:58:58 +00:00
Bob Ren	7f9c06462f	fix mypi in utils/_sympy/functions.py (#136339 ) Signed-off-by: Bob Ren <bobren@fb.com> Turns out older versions of python, in particular 3.8 shows errors that 3.12 doesn't. For posterity these are the steps I took to reproduce: ``` conda create -n py38 python=3.8 conda activate py38 pip install -r requirements.txt lintrunner init dmypy restart && lintrunner --all-files --take MYPY ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/136339 Approved by: https://github.com/Skylion007 ghstack dependencies: #136205	2024-09-20 18:39:16 +00:00
Bin Bao	f53a0f9cc1	[Inductor] Fix test_profiler_mark_wrapper_call_cuda_cuda_wrapper (#136356 ) Summary: Internal profiler behaves differently after turning on triton.autotune_at_compile_time. Needs more investigation but turning it off for this test for now. Reviewed By: henrylhtsang Differential Revision: D63035855 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136356 Approved by: https://github.com/henrylhtsang	2024-09-20 18:35:09 +00:00
Xu Song	5997354151	Add more distributed examples (#130427 ) 1. Add `gather` example 2. Add device to `scatter` example Pull Request resolved: https://github.com/pytorch/pytorch/pull/130427 Approved by: https://github.com/kwen2501	2024-09-20 18:27:27 +00:00
PyTorch MergeBot	df1eef9779	Revert "[torch][ao] Add customizable loss function to NodeAccuracySummary (#136282 )" This reverts commit f3c54ccf8f6139807f4623037c0174964a286652. Reverted https://github.com/pytorch/pytorch/pull/136282 on behalf of https://github.com/huydhn due to This breaks OSS, let revert it and land the revert internally then ([comment](https://github.com/pytorch/pytorch/pull/136282#issuecomment-2364219252))	2024-09-20 17:49:06 +00:00
Jeff Daily	15dba021bb	[ROCm][CI] upgrade CI to ROCm 6.2 (#132555 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132555 Approved by: https://github.com/pruthvistony, https://github.com/malfet	2024-09-20 17:39:31 +00:00
Chirag Pandya	29affa6b95	return instead of using skipTest (#136244 ) Summary: Return from functions instead of using `skipTest`. This is mostly to make our test report happier. Skipped tests still show up in our Broken test report. ``` OK (skipped=1) I0917 16:14:24.749060 1018907 StorageDemandControl.cpp:572] Flushing Demand Control ODS counters Skipped: Store doesn't support extended APIs ``` Test Plan: Tested locally. Test shows up as passed instead of skipped. ``` Cache hits: 99%. Commands: 125048 (cached: 124961, remote: 10, local: 77) Tests finished: Pass 1. Fail 0. Fatal 0. Skip 0. Build failure 0 ``` Differential Revision: D62912065 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136244 Approved by: https://github.com/XilunWu	2024-09-20 17:36:28 +00:00
David Berard	d7a6980078	[inductor] Make DtypeView work with cpp_wrapper without abi_compatible (#136233 ) Fixes #136159 Prior to this PR, using cpp_wrapper without abi_compatible could result in incorrect dtypes. The following block of code implements cpp_wrapper codegen for reinterpret_view for abi_compatible mode, but not for non-abi_compatible mode. `f6f1504d39/torch/_inductor/codegen/cpp_wrapper_cpu.py (L1678-L1814)` Added a test that verifies that we keep the view behavior, but returned tensors also have correct dtypes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136233 Approved by: https://github.com/FindHao, https://github.com/eellison, https://github.com/jansel	2024-09-20 17:30:35 +00:00
Aleksei Nikiforov	080a249fc2	S390x update builder image (#132983 ) S390x update builder image Pull Request resolved: https://github.com/pytorch/pytorch/pull/132983 Approved by: https://github.com/huydhn, https://github.com/malfet	2024-09-20 17:26:26 +00:00
PyTorch MergeBot	783c5ba80a	Revert "[PT2/Profiler] Add Context Info to Torch-Compiled Regions (#132765 )" This reverts commit 0b81f700aa7eb20d4b9f20e9627dd1208e50ea58. Reverted https://github.com/pytorch/pytorch/pull/132765 on behalf of https://github.com/ezyang due to implementation is not correct, needs full rewrite ([comment](https://github.com/pytorch/pytorch/pull/132765#issuecomment-2364160452))	2024-09-20 17:10:27 +00:00
IvanKobzarev	cdef760560	[aotd] Fix freezing API for subclasses (#136265 ) Original issue: https://github.com/pytorch/ao/issues/890 The problem: TracingContext.flat_params contain original params, with not desugared Subclasses. While inductor.freezing API works on aot graphs, which already desugared Subclasses. flat_params are used only for this logic and storing in them desguared subclasses fixes the issue. Testing: ``` python test/functorch/test_aotdispatch.py -k test_inductor_freezing_with_subclasses ``` Torch AO original failure: ``` python test/integration/test_integration.py -k test_int8_weight_only_quant_with_freeze ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/136265 Approved by: https://github.com/bdhirsh	2024-09-20 16:32:49 +00:00
Aditya Tewari	4842f0fac6	Enable torch build with SLEEF on ARM by default (#133339 ) Scope: Enable PyTorch build with SLEEF on Arm by default. Enable codegen kernels compilation with SLEEF on ARM platform. Enabling the build with SLEEF by default and setting `AT_BUILD_ARM_VEC256_WITH_SLEEF` as the default for Arm improves performance for some models. I have benchmarked several networks on `Neoverse-V1` using `torch.compile` with the `inductor` backend. On models like `hf_Bert_Large` , `hf_GPT_fast`, we're seeing a ~1.2x speedup (with 16 threads). The below results are run with `Batch_Size=1` and `Cores=8, 16` ![Screenshot 2024-08-27 at 17 04 23](https://github.com/user-attachments/assets/319c7ef7-1202-4145-a51a-7a80dfd5f1f6) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133339 Approved by: https://github.com/malfet, https://github.com/kimishpatel Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2024-09-20 16:02:32 +00:00
Riley Dulin	f3c54ccf8f	[torch][ao] Add customizable loss function to NodeAccuracySummary (#136282 ) Summary: Add a customizable loss function callback to NodeAccuracySummary to allow users to pass in their own loss function. Also, fix some type errors and propagate better exception messages when unexpected tensor comparisons occur. Finally, enhance the robustness of `generate_numeric_debug_handle` in the case where it is called multiple times on the same model, by avoiding reuse of the same IDs. Test Plan: Added a test for this case in `test_numeric_debugger`. Reviewed By: jerryzh168 Differential Revision: D62898297 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136282 Approved by: https://github.com/jerryzh168	2024-09-20 07:34:52 +00:00
Sun, Jiayi	687e5cf8c5	[inductor] Relax the conditions for loop split (#135335 ) Summary This PR Relaxes the conditions for loop split to support dynamic shape cases. Now the conditions that need to be met to apply loop split optimization are as follows: 1. No reduction and no mudular index for all nodes. 2. The indexing_exprs of all nodes contain only one (or more, but all the same) division, where the divisor is an integer, the dividend is one of the iter_vars, and this var, i.e. the dimension that needs to be split, is contiguous in all other indexing_exprs. Example: ``` import torch import torch.nn as nn class GN(torch.nn.Module): def __init__(self, num_groups, num_channels): super(GN, self).__init__() self.gn = nn.GroupNorm(num_groups, num_channels) def forward(self, x): return self.gn(x) input = torch.randn(2, 960, 96, 96).to(memory_format=torch.channels_last) m = GN(32, 960).eval() compiled_m = torch.compile(m, dynamic=True) with torch.no_grad(): compiled_m(input) ``` Before loop split, the node's var_ranges: `{z0: s0, z1: s2, z2: s2, z3: 960}` and indexing_exprs: `{'index0': 960s22z0 + 960s2z1 + 960z2 + z3, 'index1': 32z0 + (z3//30), 'index2': 30s22, 'index3': z3, 'index4': 960s2z0((s2*2//s2)) + 960z1((s22//s2)) + 960z2 + z3}`. After loop split `z3` will split to `30z3 + z4`, then the node's var_ranges will be changed to `{z0: s0, z1: s2, z2: s2, z3: 32, z4: 30}` and indexing_exprs will be changed to `{'index0': 960s2*2z0 + 960s2z1 + 960z2 + 30z3 + z4, 'index1': 32z0 + z3, 'index2': 30s2*2, 'index3': 30z3 + z4, 'index4': 960s2z0((s22//s2)) + 960z1((s22//s2)) + 960z2 + 30z3 + z4}` Generated code: - Before: ``` cpp_fused_native_group_norm_0 = async_compile.cpp_pybinding(['const float', 'const float', 'const float', 'float', 'float', 'float', 'const int64_t', 'const int64_t'], ''' #include "/tmp/torchinductor_jiayisun/32/c32dcqa3qidvmunis4lucp3dhoicleq5qjfjfgvpiadbbzfp6ofy.h" extern "C" void kernel(const float in_ptr0, const float* in_ptr1, const float* in_ptr2, float* out_ptr0, float* out_ptr1, float* out_ptr2, const int64_t ks0, const int64_t ks1) { #pragma omp parallel num_threads(112) { int tid = omp_get_thread_num(); { #pragma omp for collapse(2) for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(ks0); x0+=static_cast<int64_t>(1L)) { for(int64_t x1=static_cast<int64_t>(0L); x1<static_cast<int64_t>(32L); x1+=static_cast<int64_t>(1L)) { { Welford<float> tmp_acc0 = Welford<float>(); Welford<at::vec::Vectorized<float>> tmp_acc0_vec = Welford<at::vec::Vectorized<float>>(); Welford<at::vec::Vectorized<float>> masked_tmp_acc0_vec = Welford<at::vec::Vectorized<float>>(); static WeightRecp<at::vec::Vectorized<float>> wrecps0(static_cast<int64_t>(c10::div_floor_integer(static_cast<int64_t>((15L(static_cast<int64_t>(ks1ks1)))), static_cast<int64_t>(8L)))); for(int64_t x2=static_cast<int64_t>(0L); x2<static_cast<int64_t>(static_cast<int64_t>(ks1ks1)); x2+=static_cast<int64_t>(1L)) { for(int64_t x3=static_cast<int64_t>(0L); x3<static_cast<int64_t>(16L); x3+=static_cast<int64_t>(16L)) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<int64_t>(x3 + (30Lx1) + (960Lx2) + (960Lx0(static_cast<int64_t>(ks1ks1)))), static_cast<int64_t>(16)); tmp_acc0_vec = welford_combine(tmp_acc0_vec, tmp0, &wrecps0); } for(int64_t x3=static_cast<int64_t>(16L); x3<static_cast<int64_t>(30L); x3+=static_cast<int64_t>(14L)) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<int64_t>(x3 + (30Lx1) + (960Lx2) + (960Lx0(static_cast<int64_t>(ks1ks1)))), static_cast<int64_t>(14L)); masked_tmp_acc0_vec = welford_combine(masked_tmp_acc0_vec, tmp0, static_cast<int64_t>(14L), &wrecps0); } } tmp_acc0 = welford_combine(tmp_acc0, welford_vec_reduce_all(masked_tmp_acc0_vec)); tmp_acc0 = welford_combine(tmp_acc0, welford_vec_reduce_all(tmp_acc0_vec)); out_ptr0[static_cast<int64_t>(x1 + (32Lx0))] = static_cast<float>(tmp_acc0.mean); out_ptr1[static_cast<int64_t>(x1 + (32Lx0))] = static_cast<float>(tmp_acc0.m2); } } } } { #pragma omp for collapse(2) for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(ks0); x0+=static_cast<int64_t>(1L)) { for(int64_t x1=static_cast<int64_t>(0L); x1<static_cast<int64_t>(ks1); x1+=static_cast<int64_t>(1L)) { #pragma GCC ivdep for(int64_t x2=static_cast<int64_t>(0L); x2<static_cast<int64_t>(ks1); x2+=static_cast<int64_t>(1L)) { #pragma GCC ivdep for(int64_t x3=static_cast<int64_t>(0L); x3<static_cast<int64_t>(960L); x3+=static_cast<int64_t>(1L)) { auto tmp0 = in_ptr0[static_cast<int64_t>(x3 + (960Lx2) + (960Lks1x1) + (960Lx0(static_cast<int64_t>(ks1ks1))))]; auto tmp1 = out_ptr0[static_cast<int64_t>((32Lx0) + (c10::div_floor_integer(static_cast<int64_t>(x3), static_cast<int64_t>(30L))))]; auto tmp3 = out_ptr1[static_cast<int64_t>((32Lx0) + (c10::div_floor_integer(static_cast<int64_t>(x3), static_cast<int64_t>(30L))))]; auto tmp11 = in_ptr1[static_cast<int64_t>(x3)]; auto tmp13 = in_ptr2[static_cast<int64_t>(x3)]; auto tmp2 = decltype(tmp0)(tmp0 - tmp1); auto tmp4 = 30L(static_cast<int64_t>(ks1ks1)); auto tmp5 = c10::convert<float>(tmp4); auto tmp6 = tmp3 / tmp5; auto tmp7 = static_cast<float>(1e-05); auto tmp8 = decltype(tmp6)(tmp6 + tmp7); auto tmp9 = 1 / std::sqrt(tmp8); auto tmp10 = decltype(tmp2)(tmp2 tmp9); auto tmp12 = decltype(tmp10)(tmp10 * tmp11); auto tmp14 = decltype(tmp12)(tmp12 + tmp13); out_ptr2[static_cast<int64_t>(x3 + (960Lx2) + (960Lx1(c10::div_floor_integer(static_cast<int64_t>((static_cast<int64_t>(ks1ks1))), static_cast<int64_t>(ks1)))) + (960Lks1x0(c10::div_floor_integer(static_cast<int64_t>((static_cast<int64_t>(ks1ks1))), static_cast<int64_t>(ks1)))))] = tmp14; } } } } } } } ''') async_compile.wait(globals()) del async_compile def call(args): arg0_1, arg1_1, arg2_1, arg3_1, arg4_1 = args args.clear() s0 = arg2_1 s2 = arg3_1 assert_size_stride(arg0_1, (960, ), (1, )) assert_size_stride(arg1_1, (960, ), (1, )) assert_size_stride(arg4_1, (s0, 960, s2, s2), (960(s2s2), 1, 960s2, 960)) buf0 = empty_strided_cpu((s0, 32, 1, 1), (32, 1, 32s0, 32s0), torch.float32) buf1 = empty_strided_cpu((s0, 32, 1, 1), (32, 1, 32s0, 32s0), torch.float32) buf3 = empty_strided_cpu((s0, 960, s2, s2), (960s2((s2s2) // s2), 1, 960((s2s2) // s2), 960), torch.float32) cpp_fused_native_group_norm_0(arg4_1, arg0_1, arg1_1, buf0, buf1, buf3, s0, s2) del arg0_1 del arg1_1 del arg4_1 return (buf3, ) ``` After: ``` cpp_fused_native_group_norm_0 = async_compile.cpp_pybinding(['const float', 'const float', 'const float', 'float', 'float', 'float', 'const int64_t', 'const int64_t'], ''' #include "/tmp/torchinductor_jiayisun/32/c32dcqa3qidvmunis4lucp3dhoicleq5qjfjfgvpiadbbzfp6ofy.h" extern "C" void kernel(const float* in_ptr0, const float* in_ptr1, const float* in_ptr2, float* out_ptr0, float* out_ptr1, float* out_ptr2, const int64_t ks0, const int64_t ks1) { #pragma omp parallel num_threads(112) { int tid = omp_get_thread_num(); { #pragma omp for collapse(2) for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(ks0); x0+=static_cast<int64_t>(1L)) { for(int64_t x1=static_cast<int64_t>(0L); x1<static_cast<int64_t>(32L); x1+=static_cast<int64_t>(1L)) { { Welford<float> tmp_acc0 = Welford<float>(); Welford<at::vec::Vectorized<float>> tmp_acc0_vec = Welford<at::vec::Vectorized<float>>(); Welford<at::vec::Vectorized<float>> masked_tmp_acc0_vec = Welford<at::vec::Vectorized<float>>(); static WeightRecp<at::vec::Vectorized<float>> wrecps0(static_cast<int64_t>(c10::div_floor_integer(static_cast<int64_t>((15L(static_cast<int64_t>(ks1ks1)))), static_cast<int64_t>(8L)))); for(int64_t x2=static_cast<int64_t>(0L); x2<static_cast<int64_t>(static_cast<int64_t>(ks1ks1)); x2+=static_cast<int64_t>(1L)) { for(int64_t x3=static_cast<int64_t>(0L); x3<static_cast<int64_t>(16L); x3+=static_cast<int64_t>(16L)) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<int64_t>(x3 + (30Lx1) + (960Lx2) + (960Lx0(static_cast<int64_t>(ks1ks1)))), static_cast<int64_t>(16)); tmp_acc0_vec = welford_combine(tmp_acc0_vec, tmp0, &wrecps0); } for(int64_t x3=static_cast<int64_t>(16L); x3<static_cast<int64_t>(30L); x3+=static_cast<int64_t>(14L)) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<int64_t>(x3 + (30Lx1) + (960Lx2) + (960Lx0(static_cast<int64_t>(ks1ks1)))), static_cast<int64_t>(14L)); masked_tmp_acc0_vec = welford_combine(masked_tmp_acc0_vec, tmp0, static_cast<int64_t>(14L), &wrecps0); } } tmp_acc0 = welford_combine(tmp_acc0, welford_vec_reduce_all(masked_tmp_acc0_vec)); tmp_acc0 = welford_combine(tmp_acc0, welford_vec_reduce_all(tmp_acc0_vec)); out_ptr0[static_cast<int64_t>(x1 + (32Lx0))] = static_cast<float>(tmp_acc0.mean); out_ptr1[static_cast<int64_t>(x1 + (32Lx0))] = static_cast<float>(tmp_acc0.m2); } } } } { #pragma omp for collapse(2) for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(ks0); x0+=static_cast<int64_t>(1L)) { for(int64_t x1=static_cast<int64_t>(0L); x1<static_cast<int64_t>(ks1); x1+=static_cast<int64_t>(1L)) { #pragma GCC ivdep for(int64_t x2=static_cast<int64_t>(0L); x2<static_cast<int64_t>(ks1); x2+=static_cast<int64_t>(1L)) { #pragma GCC ivdep for(int64_t x3=static_cast<int64_t>(0L); x3<static_cast<int64_t>(32L); x3+=static_cast<int64_t>(1L)) { for(int64_t x4=static_cast<int64_t>(0L); x4<static_cast<int64_t>(16L); x4+=static_cast<int64_t>(16L)) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<int64_t>(x4 + (30Lx3) + (960Lx2) + (960Lks1x1) + (960Lx0(static_cast<int64_t>(ks1ks1)))), static_cast<int64_t>(16)); auto tmp1 = out_ptr0[static_cast<int64_t>(x3 + (32Lx0))]; auto tmp4 = out_ptr1[static_cast<int64_t>(x3 + (32Lx0))]; auto tmp13 = at::vec::Vectorized<float>::loadu(in_ptr1 + static_cast<int64_t>(x4 + (30Lx3)), static_cast<int64_t>(16)); auto tmp15 = at::vec::Vectorized<float>::loadu(in_ptr2 + static_cast<int64_t>(x4 + (30Lx3)), static_cast<int64_t>(16)); auto tmp2 = at::vec::Vectorized<float>(tmp1); auto tmp3 = tmp0 - tmp2; auto tmp5 = 30L(static_cast<int64_t>(ks1ks1)); auto tmp6 = c10::convert<float>(tmp5); auto tmp7 = tmp4 / tmp6; auto tmp8 = static_cast<float>(1e-05); auto tmp9 = decltype(tmp7)(tmp7 + tmp8); auto tmp10 = 1 / std::sqrt(tmp9); auto tmp11 = at::vec::Vectorized<float>(tmp10); auto tmp12 = tmp3 * tmp11; auto tmp14 = tmp12 * tmp13; auto tmp16 = tmp14 + tmp15; tmp16.store(out_ptr2 + static_cast<int64_t>(x4 + (30Lx3) + (960Lx2) + (960Lx1(c10::div_floor_integer(static_cast<int64_t>((static_cast<int64_t>(ks1ks1))), static_cast<int64_t>(ks1)))) + (960Lks1x0(c10::div_floor_integer(static_cast<int64_t>((static_cast<int64_t>(ks1ks1))), static_cast<int64_t>(ks1)))))); } for(int64_t x4=static_cast<int64_t>(16L); x4<static_cast<int64_t>(30L); x4+=static_cast<int64_t>(14L)) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<int64_t>(x4 + (30Lx3) + (960Lx2) + (960Lks1x1) + (960Lx0(static_cast<int64_t>(ks1ks1)))), static_cast<int64_t>(14L)); auto tmp1 = out_ptr0[static_cast<int64_t>(x3 + (32Lx0))]; auto tmp4 = out_ptr1[static_cast<int64_t>(x3 + (32Lx0))]; auto tmp13 = at::vec::Vectorized<float>::loadu(in_ptr1 + static_cast<int64_t>(x4 + (30Lx3)), static_cast<int64_t>(14L)); auto tmp15 = at::vec::Vectorized<float>::loadu(in_ptr2 + static_cast<int64_t>(x4 + (30Lx3)), static_cast<int64_t>(14L)); auto tmp2 = at::vec::Vectorized<float>(tmp1); auto tmp3 = tmp0 - tmp2; auto tmp5 = 30L(static_cast<int64_t>(ks1ks1)); auto tmp6 = c10::convert<float>(tmp5); auto tmp7 = tmp4 / tmp6; auto tmp8 = static_cast<float>(1e-05); auto tmp9 = decltype(tmp7)(tmp7 + tmp8); auto tmp10 = 1 / std::sqrt(tmp9); auto tmp11 = at::vec::Vectorized<float>(tmp10); auto tmp12 = tmp3 * tmp11; auto tmp14 = tmp12 * tmp13; auto tmp16 = tmp14 + tmp15; tmp16.store(out_ptr2 + static_cast<int64_t>(x4 + (30Lx3) + (960Lx2) + (960Lx1(c10::div_floor_integer(static_cast<int64_t>((static_cast<int64_t>(ks1ks1))), static_cast<int64_t>(ks1)))) + (960Lks1x0(c10::div_floor_integer(static_cast<int64_t>((static_cast<int64_t>(ks1ks1))), static_cast<int64_t>(ks1))))), static_cast<int64_t>(14L)); } } } } } } } } ''') async_compile.wait(globals()) del async_compile def call(args): arg0_1, arg1_1, arg2_1, arg3_1, arg4_1 = args args.clear() s0 = arg2_1 s2 = arg3_1 assert_size_stride(arg0_1, (960, ), (1, )) assert_size_stride(arg1_1, (960, ), (1, )) assert_size_stride(arg4_1, (s0, 960, s2, s2), (960(s2s2), 1, 960s2, 960)) buf0 = empty_strided_cpu((s0, 32, 1, 1), (32, 1, 32s0, 32s0), torch.float32) buf1 = empty_strided_cpu((s0, 32, 1, 1), (32, 1, 32s0, 32s0), torch.float32) buf3 = empty_strided_cpu((s0, 960, s2, s2), (960s2((s2s2) // s2), 1, 960((s2*s2) // s2), 960), torch.float32) cpp_fused_native_group_norm_0(arg4_1, arg0_1, arg1_1, buf0, buf1, buf3, s0, s2) del arg0_1 del arg1_1 del arg4_1 return (buf3, ) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/135335 Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5, https://github.com/jansel	2024-09-20 05:42:52 +00:00
albanD	cf31724db7	Fix and improvements to toward 3.13t (#136319 ) Small part of https://github.com/pytorch/pytorch/pull/130689 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136319 Approved by: https://github.com/malfet, https://github.com/Skylion007	2024-09-20 04:22:18 +00:00
Tom Ritchford	e3ea5429f2	Implement GetAttrVariable.as_python_constant() (#134216 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134216 Approved by: https://github.com/amjames, https://github.com/williamwen42	2024-09-20 03:44:43 +00:00
Sergii Dymchenko	d9aca9914b	Remove duplicated words in library.rst (#136340 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136340 Approved by: https://github.com/svekars	2024-09-20 03:30:54 +00:00
Huy Do	fe0e9fb385	Fix flaky SIGSEGV crash in test_profile_memory (#136304 ) Fixes https://github.com/pytorch/pytorch/issues/132331 We need another barrier here to ensure that the main thread doesn't stop the profiler while other threads are still using it (and crash). I can reliably reproduce the issue with `pytest -v test/profiler/test_cpp_thread.py -k test_profile_memory --flake-finder`. ### Testing `pytest -v test/profiler/test_cpp_thread.py --flake-finder` all passes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136304 Approved by: https://github.com/briancoutinho	2024-09-20 02:56:49 +00:00
Kurt Mohler	d45b0151e5	Add deterministic path for CUDA `cumsum` (#136224 ) Change `cumsum` to call its decomposition when `use_deterministic_algorithms(True)` and input is CUDA. Fixes #89492 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136224 Approved by: https://github.com/ezyang, https://github.com/justinchuby	2024-09-20 02:41:56 +00:00
Felix Su	1dfa07e885	passing FileTimerRequests.to_json() to log_debug_info_for_expired_timers for a better debugging experience (#135913 ) Summary: The change involves passing the expired timers to the log_debug_info_for_expired_timers function after to_json() has been applied . This change is made to provide a better debugging experience for the user. Test Plan: unit tests Reviewed By: gag1jain Differential Revision: D62408767 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135913 Approved by: https://github.com/gag1jain	2024-09-20 00:54:02 +00:00
Tristan Rice	bebf5302ba	TCPStoreLibUvBackend: trace operations (#136320 ) Summary: This logs all operations when tracing log level is enabled for the `TCPStoreLibUvBackend`. This is very useful for debugging collective operations when issues occur as it logs all hosts and the keys that they're modifying. To minimize total data we only log the keys and not the values This changes the C10D_* macros to be much more efficient -- previously we would always format the log string even if they would never be printed which is very wasteful for detailed tracing. This now gates them with an if statement to achieve the same behavior with no overhead Test Plan: ``` TORCH_DISTRIBUTED_DEBUG=DETAIL torchrun --nnodes 1 --nproc_per_node 1 --no-python /bin/bash -c "echo foo" ``` ``` I0919 09:26:52.352013 34271 TCPStore.cpp:285] [c10d - debug] The server has started on port = 29500. I0919 09:26:52.352246 34271 socket.cpp:783] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (127.0.0.1, 29500). I0919 09:26:52.352241 36903 TCPStoreLibUvBackend.cpp:1173] [c10d - debug] Uv main loop running I0919 09:26:52.352308 34271 socket.cpp:854] [c10d - trace] The client socket is attempting to connect to [localhost]:29500. I0919 09:26:52.353633 34271 socket.cpp:945] [c10d] The client socket has connected to [localhost]:29500 on SocketImpl(fd=41, addr=[localhost]:45646, remote=[localhost]:29500). I0919 09:26:52.354422 34271 TCPStore.cpp:321] [c10d - debug] TCP client connected to host 127.0.0.1:29500 I0919 09:26:52.354558 36903 TCPStoreLibUvBackend.cpp:774] [c10d - trace] validate magic:1015412686 address:[localhost]:45646 I0919 09:26:52.354638 36903 TCPStoreLibUvBackend.cpp:789] [c10d - trace] ping nonce:34271 address:[localhost]:45646 I0919 09:26:52.356122 36903 TCPStoreLibUvBackend.cpp:866] [c10d - trace] add key:init/ val:1 address:[localhost]:45646 I0919 09:26:52.356308 36903 TCPStoreLibUvBackend.cpp:930] [c10d - trace] wait key_count:1 address:[localhost]:45646 I0919 09:26:52.356410 36903 TCPStoreLibUvBackend.cpp:846] [c10d - trace] get key:init/ address:[localhost]:45646 I0919 09:26:52.358688 36903 TCPStoreLibUvBackend.cpp:808] [c10d - trace] set key:/none/torchelastic/role_info/0 address:[localhost]:45646 I0919 09:26:52.360177 36903 TCPStoreLibUvBackend.cpp:930] [c10d - trace] wait key_count:1 address:[localhost]:45646 I0919 09:26:52.360296 36903 TCPStoreLibUvBackend.cpp:1004] [c10d - trace] multi_get key_count:1 address:[localhost]:45646 I0919 09:26:52.362076 36903 TCPStoreLibUvBackend.cpp:1036] [c10d - trace] multi_set key_count:1 address:[localhost]:45646 I0919 09:26:52.364001 36903 TCPStoreLibUvBackend.cpp:930] [c10d - trace] wait key_count:1 address:[localhost]:45646 I0919 09:26:52.364091 36903 TCPStoreLibUvBackend.cpp:846] [c10d - trace] get key:/none/torchelastic/assigned_ranks/0 address:[localhost]:45646 ``` Differential Revision: D62924454 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136320 Approved by: https://github.com/c-p-i-o, https://github.com/XilunWu	2024-09-20 00:53:21 +00:00
Wei Wang	9b424aac1d	[CI][CUSPARSELT] Extend cusparselt installation script to support cuda 12.6 (#136321 ) To prepare for future cuda updates. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136321 Approved by: https://github.com/Skylion007, https://github.com/eqy	2024-09-19 23:45:57 +00:00
Brian Hirsh	172ecf78b7	DTensor: dont hash symint tensor input in propagate_tensor_meta (#136266 ) This fixes a subset of issues for dynamic shapes + DTensor. It's pretty easy to run into other issues - it's likely that we need https://github.com/pytorch/pytorch/pull/125941 to land for DTensor + dynamic shapes to work more generally. I ended up writing a test that had dynamic shape inputs but not dynamic shape outputs in order to properly test this fix Pull Request resolved: https://github.com/pytorch/pytorch/pull/136266 Approved by: https://github.com/ezyang, https://github.com/yf225	2024-09-19 20:39:36 +00:00
cyy	7bbdf87517	[22/N] Fix clang-tidy warnings in jit (#134829 ) Follows #134537 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134829 Approved by: https://github.com/ezyang	2024-09-19 19:24:42 +00:00
Laith Sakka	b71802fa79	add basic_modules_ListOfLinears_inductor_gpu_force_shape_pad (#136175 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136175 Approved by: https://github.com/ezyang	2024-09-19 19:15:50 +00:00
Rachel Guo	8cba0ec958	[AOTI][Tooling][8/n] Add option to pinpoint kernel names in debug printer (#136182 ) Summary: Add a third mode where we only print kernel names without dumping any intermediate actual tensor value info. It can be helpful in quickly identifying the troublesome kernels in CUDA IMA issues. thanks ColinPeppler and henrylhtsang for this "feature request". Test Plan: The output can look like this if set the `AOT_INDUCTOR_DEBUG_INTERMEDIATE_VALUE_PRINTER=3`: {F1871629091} Differential Revision: D62791371 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136182 Approved by: https://github.com/henrylhtsang	2024-09-19 18:51:57 +00:00
Shan19900305	49723a8ff3	fix stride compare failed when size value equal to one in ForeachUtils.h (#134546 ) When size value equal to one, tensor strides value need be skipped to compare. @ezyang Pull Request resolved: https://github.com/pytorch/pytorch/pull/134546 Approved by: https://github.com/janeyx99	2024-09-19 18:43:41 +00:00
Jerry Mannil	ccca3de0cd	[ROCm] Enable Flex attention tests on AMD gpus (#136245 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136245 Approved by: https://github.com/malfet	2024-09-19 18:02:41 +00:00
Bob Ren	8d9c42735a	Type _sympy/functions.py [1/n] (#136205 ) Signed-off-by: Bob Ren <bobren@fb.com> I was chatting with @jamesjwu about strategies to learn the code and he suggested adding types to some files. This stack of PRs adds types to _sympy/functions.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/136205 Approved by: https://github.com/Skylion007, https://github.com/jamesjwu	2024-09-19 17:15:53 +00:00
James Wu	803ce507f1	Log structured logging overhead to dynamo compile (kinda) (#136142 ) Summary: X-link: https://github.com/pytorch/benchmark/pull/2454 This adds structured logging overhead at a per compile basis to compilation metrics. To do so, we track the frame_id_frame_compile_id that trace_structured uses to categorize compiles, and use that as the key in our timing table. Implementation notes: - If there's times we call trace_structured without a compile id, the time won't be measured. Not really a good way around that today given the compile id framework of compilation metrics. Strobelight is still the best way to measure on a per job basis. - We don't actually measure the time it takes to log the compilation metrics itself. Fundamentally, it's not possible to log this properly if we're storing the logging number in compilation metrics, since there's no way to measure it before we do it(unless we want discrepancies between dynamo_compile and tlparse, which seems suboptimal). Hopefully for a large job, the cost of structured_logging compilation metrics itself is small. - I wanted to use frame_phase_timing here, but there's a bunch of ids to iron out, and I don't really want to deal with that headache. compilation_time_metrics is sort of what I want, but that isn't by frame/compile id, so it's also a bit off. Putting it into torch.logging as a separate thing so logging tracks its own overhead seems fine, though. Test Plan: Run benchmarks/nanogpt and staging logger. See that the new compilation metric is logged to the staged dynamo_compile table: https://fburl.com/scuba/logger_staging_jjwu_30582a48f1ff9cf5f4ac50a4c40af/xazjg5xq Note that the sum(structured_logging_overhead_s) / sum(entire_frame_compile_time) = 8.387 / 124.278 = 6%, which seems reasonable as the overhead for a small compilation like this. You can also look at samples for a more detailed log of this. Reviewed By: oulgen Differential Revision: D62643611 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136142 Approved by: https://github.com/bobrenjc93	2024-09-19 16:11:38 +00:00
Andrew Gu	65df26f615	[FSDP2] Fixed 2D mismatched grad placements (#136237 ) ``` CUDA_VISIBLE_DEVICES=2,3,6,7 pytest test/distributed/_composable/test_composability/test_2d_composability.py -k test_train_parity_2d_transformer ``` Differential Revision: [D62964658](https://our.internmc.facebook.com/intern/diff/D62964658) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136237 Approved by: https://github.com/weifengpy	2024-09-19 14:35:15 +00:00
PyTorch MergeBot	4ea741d24f	Revert "Reland D62220158 (#136213 )" This reverts commit 083c9149b75cd918b6fb2795050d7173923a3629. Reverted https://github.com/pytorch/pytorch/pull/136213 on behalf of https://github.com/jeanschmidt due to Seems to have introduced regressions in rocm signals ([comment](https://github.com/pytorch/pytorch/pull/136213#issuecomment-2360885064))	2024-09-19 12:44:54 +00:00
Igor Sugak	bce52d0b60	[CODEMOD][caffe2] use npt.NDArray instead of np.ndarray in type annotations (#136288 ) Summary: To facilitate PSS-2 upgrade, this uses `ndt.NDArray` instead of `nd.ndarray` in type annotations. In Numpy-1.19 (PSS-1) it's an alias to `nd.ndarray` -- a noop. In Numpy-1.24, `ndt.NDArray` a proper generic type, and without this change uses of `nd.ndarray` generate this Pyre type error: ```counterexample Invalid type parameters [24]: Generic type `np.ndarray` expects 2 type parameters. ``` Test Plan: Sandcastle plus visual inspection Differential Revision: D62977370 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136288 Approved by: https://github.com/kit1980	2024-09-19 12:40:36 +00:00
Jan Wieczorek	908a5689eb	Return unsafe_view instead of view from matmul when folding occurs (#134568 ) When tensor folding occurs during matmul operation returned tensor is a view. This can cause issues when matmul is used inside a custom function and such view is then returned as output. Then it cannot be modified inplace and causes errors. It can be especially problematic when after such function inplace allreduce is performed. Issue is resolved when unsafe_view is returned from matmul instead. This solution aligns matmul decomposition with eager implementation in such a way that a non view tensor is returned. Test included in this PR reproduces the issue. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134568 Approved by: https://github.com/zou3519	2024-09-19 11:52:16 +00:00
Huy Do	db80b98ec4	XFAIL test_segfault (#136252 ) Fixes https://github.com/pytorch/pytorch/issues/128551 As this has been failing in trunk for a while and there is no owner yet to fix it properly. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136252 Approved by: https://github.com/andrewkho	2024-09-19 04:17:06 +00:00
Duygu Altinok	775517693a	Add type checks for Tensor.add_ (#135864 ) Fixes #127049 There's already a meta func in `meta_registrations.py` for `add_` and `sub_` methods. I added a second meta function for error checking, i.e `int.add/sub_(float)` and `bool.add/sub_(other types)` . Also the corresponding test with Dynamo passes, removed `@xfailIfTorchDynamo`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135864 Approved by: https://github.com/williamwen42	2024-09-19 03:09:36 +00:00
William Wen	e037bb326f	[dynamo] fix crash in InspectSignatureVariable (#136010 ) Fix crash that was happening in https://github.com/pytorch/pytorch/issues/128095, because we were trying to extract a constant incorrectly. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136010 Approved by: https://github.com/yanboliang, https://github.com/anijain2305, https://github.com/jansel	2024-09-19 00:23:00 +00:00
Jerry Zhang	f2b0fc89f2	Add uint16 support for observer (#136238 ) Summary: att Test Plan: python test/test_quantization.py -k TestObserver Reviewers: Subscribers: Tasks: Tags: Differential Revision: [D62909821](https://our.internmc.facebook.com/intern/diff/D62909821) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136238 Approved by: https://github.com/tarun292	2024-09-18 23:52:18 +00:00
Nikita Shulga	068c80e6b6	[BE][MPS] Fix deprecation warnings on MacOS 15.0 (#136292 ) [reverseSquareRootWithTensor:](https://developer.apple.com/documentation/metalperformanceshadersgraph/mpsgraph/reversesquareroot(with:name:)?changes=__8&language=objc) were deprecated in favor of [reciprocalSquareRootWithTensor:](https://developer.apple.com/documentation/metalperformanceshadersgraph/mpsgraph/reciprocalsquareroot(_:name:)?changes=__8&language=objc) Without it, following warnings are generated if compiled on recently released MacOS Sequoia: ``` /Users/malfet/git/pytorch/pytorch/aten/src/ATen/native/mps/operations/Normalization.mm:720:35: warning: 'reverseSquareRootWithTensor:name:' is deprecated: first deprecated in macOS 15.0 [-Wdeprecated-declarations] 720 \| rsqrtTensor = [mpsGraph reverseSquareRootWithTensor:varianceEpsTensor name:nil]; \| ^~~~~~~~~~~~~~~~~~~~~~~~~~~ \| reciprocalSquareRootWithTensor /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX15.0.sdk/usr/include/c++/v1/__type_traits/invoke.h:341:10: note: in instantiation of function template specialization 'at::native::batch_norm_backward_mps(const Tensor &, const Tensor &, const std::optional<Tensor> &, const std::optional<Tensor> &, const std::optional<Tensor> &, const std::optional<Tensor> &, const std::optional<Tensor> &, bool, double, std::array<bool, 3>)::(anonymous class)::operator()<MPSGraph , CachedGraph >' requested here 341 \| decltype(std::declval<_Fp>()(std::declval<_Args>()...)) \| ^ /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX15.0.sdk/usr/include/c++/v1/__type_traits/invoke.h:351:19: note: while substituting deduced template arguments into function template '__invoke' [with _Fp = (lambda at /Users/malfet/git/pytorch/pytorch/aten/src/ATen/native/mps/operations/Normalization.mm:623:68) &, _Args = <MPSGraph , CachedGraph >] 351 \| static decltype(std::__invoke(std::declval<_XFp>(), std::declval<_XArgs>()...)) __try_call(int); \| ^ /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX15.0.sdk/usr/include/c++/v1/__type_traits/invoke.h:357:28: note: while substituting deduced template arguments into function template '__try_call' [with _XFp = (lambda at /Users/malfet/git/pytorch/pytorch/aten/src/ATen/native/mps/operations/Normalization.mm:623:68) &, _XArgs = (no value)] 357 \| using _Result = decltype(__try_call<_Fp, _Args...>(0)); \| ^ /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX15.0.sdk/usr/include/c++/v1/__type_traits/conjunction.h:27:32: note: in instantiation of template class 'std::__invokable_r<void, (lambda at /Users/malfet/git/pytorch/pytorch/aten/src/ATen/native/mps/operations/Normalization.mm:623:68) &, MPSGraph , CachedGraph >' requested here 27 \| __expand_to_true<__enable_if_t<_Pred::value>...> __and_helper(int); \| ^ /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX15.0.sdk/usr/include/c++/v1/__type_traits/conjunction.h:38:39: note: while substituting explicitly-specified template arguments into function template '__and_helper' 38 \| using _And _LIBCPP_NODEBUG = decltype(std::__and_helper<_Pred...>(0)); \| ^ /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX15.0.sdk/usr/include/c++/v1/__functional/function.h:828:20: note: (skipping 1 context in backtrace; use -ftemplate-backtrace-limit=0 to see all) 828 \| bool = _And< _IsNotSame<__remove_cvref_t<_Fp>, function>, __invokable<_Fp, _ArgTypes...> >::value> \| ^ /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX15.0.sdk/usr/include/c++/v1/__functional/function.h:841:49: note: in instantiation of default argument for '__callable<(lambda at /Users/malfet/git/pytorch/pytorch/aten/src/ATen/native/mps/operations/Normalization.mm:623:68) &>' required here 841 \| using _EnableIfLValueCallable = __enable_if_t<__callable<_Fp&>::value>; \| ^~~~~~~~~~~~~~~~ /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX15.0.sdk/usr/include/c++/v1/__functional/function.h:851:32: note: in instantiation of template type alias '_EnableIfLValueCallable' requested here 851 \| template <class _Fp, class = _EnableIfLValueCallable<_Fp>> \| ^ /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX15.0.sdk/usr/include/c++/v1/__functional/function.h:852:25: note: in instantiation of default argument for 'function<(lambda at /Users/malfet/git/pytorch/pytorch/aten/src/ATen/native/mps/operations/Normalization.mm:623:68)>' required here 852 \| _LIBCPP_HIDE_FROM_ABI function(_Fp); \| ^~~~~~~~~~~~~ /Users/malfet/git/pytorch/pytorch/aten/src/ATen/native/mps/operations/Normalization.mm:623:68: note: while substituting deduced template arguments into function template 'function' [with _Fp = (lambda at /Users/malfet/git/pytorch/pytorch/aten/src/ATen/native/mps/operations/Normalization.mm:623:68), $1 = (no value)] 623 \| auto cachedGraph = LookUpOrCreateCachedGraph<CachedGraph>(key, [&](auto mpsGraph, auto newCachedGraph) { \| ^ /Users/malfet/git/pytorch/pytorch/aten/src/ATen/native/mps/operations/Normalization.mm:623:24: note: while substituting deduced template arguments into function template 'LookUpOrCreateCachedGraph' [with T = CachedGraph] 623 \| auto cachedGraph = LookUpOrCreateCachedGraph<CachedGraph>(key, [&](auto mpsGraph, auto newCachedGraph) { \| ^ /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX15.0.sdk/System/Library/Frameworks/MetalPerformanceShadersGraph.framework/Headers/MPSGraphArithmeticOps.h:123:1: note: 'reverseSquareRootWithTensor:name:' has been explicitly marked deprecated here 123 \| -(MPSGraphTensor ) reverseSquareRootWithTensor:(MPSGraphTensor ) tensor \| ^ /Users/malfet/git/pytorch/pytorch/aten/src/ATen/native/mps/operations/Normalization.mm:745:37: warning: 'reverseSquareRootWithTensor:name:' is deprecated: first deprecated in macOS 15.0 [-Wdeprecated-declarations] 745 \| rsqrtTensor = [mpsGraph reverseSquareRootWithTensor:varianceEpsTensor name:nil]; \| ^~~~~~~~~~~~~~~~~~~~~~~~~~~ \| reciprocalSquareRootWithTensor /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX15.0.sdk/System/Library/Frameworks/MetalPerformanceShadersGraph.framework/Headers/MPSGraphArithmeticOps.h:123:1: note: 'reverseSquareRootWithTensor:name:' has been explicitly marked deprecated here 123 \| -(MPSGraphTensor ) reverseSquareRootWithTensor:(MPSGraphTensor ) tensor \| ^ 2 warnings generated. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/136292 Approved by: https://github.com/kit1980	2024-09-18 23:38:31 +00:00
Nikita Shulga	b9a197df77	[BE][MPS] Delete duplicated code in `View.mm` (#136295 ) After https://github.com/pytorch/pytorch/pull/135706 `getGatherScatterScalarType` returns exactly the same results as `scalarToMetalTypeString` , so delete the function and call `scalarToMetalTypeString` Pull Request resolved: https://github.com/pytorch/pytorch/pull/136295 Approved by: https://github.com/kit1980	2024-09-18 22:44:43 +00:00
Siju Samuel	f1ad680818	[dynamo]Remove stream hardcoding in dynamo VariableBuilder (#131763 ) Fixes #ISSUE_NUMBER Recent change from PR#123487 used torch.cuda.Stream directly and this causes failure for other backends. This PR will generalize the stream handling for all backends like cuda/hpu/xpu Pull Request resolved: https://github.com/pytorch/pytorch/pull/131763 Approved by: https://github.com/yanboliang, https://github.com/yf225	2024-09-18 22:32:34 +00:00
Will Feng	bc9597b7d8	[Traceable FSDP2] Minor refactor to traceable FSDP2 unit tests (#136219 ) Changes in this PR: - Monkey-patching `F.scaled_dot_product_attention` with a lambda seems to not work in some cases. This PR avoids using a lambda. - Running `fullgraph=True` and `fullgraph=False` in the same unit test seems to cause the two cases to interfere with each other and causes error. This PR splits them into two separate unit tests. - The checks in the unit tests might not work with compile cache. This PR turns off the cache in order to have a more predictable compile behavior to do unit test on. Test commands: - `pytest -rA test/distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_nested_fully_shard_backend_inductor_fullgraph_True` - `pytest -rA test/distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_nested_fully_shard_backend_inductor_fullgraph_False` - `pytest -rA test/distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_transformer_backend_inductor_fullgraph_True` - `pytest -rA test/distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_transformer_backend_inductor_fullgraph_False` Pull Request resolved: https://github.com/pytorch/pytorch/pull/136219 Approved by: https://github.com/yifuwang	2024-09-18 22:30:23 +00:00
Isuru Fernando	1a86d8aa29	Fix calling Add._from_args and Mul._from_args (#136143 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136143 Approved by: https://github.com/ezyang	2024-09-18 20:51:04 +00:00
Atul Jangra	aae68e2976	Add wait counter for nccl abort (#136067 ) Summary: Quite a few times, we see the NCCL PG abort taking too long. There's no easy way to measure this, so let's add a counter to measure this across the stack. This will help us measure how much time we take the NCCL abort. Test Plan: Unit tests Reviewed By: c-p-i-o Differential Revision: D62675010 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136067 Approved by: https://github.com/fduwjj	2024-09-18 20:14:10 +00:00
eqy	68a7246f13	[cuDNN][conv][A100] Bump tolerances for `vmap_autograd_grad` `conv2d` on A100 (#136178 ) Likely due to a cuDNN heuristics update Pull Request resolved: https://github.com/pytorch/pytorch/pull/136178 Approved by: https://github.com/Skylion007	2024-09-18 19:42:13 +00:00
maajidkhann	5a6ddbcc3b	Extending the Pytorch vec backend for SVE (ARM) (#119571 ) Motivation: In Pytorch, Aten vectorization supports multiple platforms, including x86 and Arm, as well as multiple data types. It provides a generic implementation of Vector (Vec) type that allows the programmer to write code packing various primitives (such as floats) within 256bit & 512bits registers. It can be extended to support other ISAs easily by adding more VecISA sub-classes. Reference Link: https://github.com/pytorch/pytorch/tree/main/aten/src/ATen/cpu/vec This PR: * Our goal with this contribution is to add support for SVE backend for Vec in the Aten vectorization for CPU backend which can be benefitted by any ARM architecture supported CPU's that supports SVE. * More about SVE ISA for ARM: [https://developer.arm.com/Architectures/Scalable Vector Extensions](https://developer.arm.com/Architectures/Scalable%20Vector%20Extensions) * We are using the ARM C Language Extensions for SVE (https://developer.arm.com/documentation/102699/0100/Optimizing-with-intrinsics ) to accelerate performance for various operators in the SVE backend for Vec. * Currently we are adding support only for SVE ISA with the vector length of 256 bits (SVE 256). In future, we plan to extend this SVE support for other vector lengths as well. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119571 Approved by: https://github.com/malfet, https://github.com/snadampal Co-authored-by: Divya Kotadiya <divya.kotadiya@fujitsu.com>	2024-09-18 18:59:10 +00:00
Jack Taylor	bad69044d8	[ROCm] upgrade ROCm CI builds to py3.10 (#134108 ) Upgrade ROCm CI builds to py3.10 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134108 Approved by: https://github.com/jeffdaily, https://github.com/jithunnair-amd, https://github.com/atalman	2024-09-18 17:39:34 +00:00
fduwjj	3efaa016b1	[c10d] Make test compatible for new pytest (#136158 ) Temporary fix to the issue in https://github.com/pytorch/pytorch/issues/127517. Short-term fix following CPython: `51aefc5bf9/Lib/unittest/case.py (L419-L426)` Differential Revision: [D62878083](https://our.internmc.facebook.com/intern/diff/D62878083) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136158 Approved by: https://github.com/fegin	2024-09-18 17:10:55 +00:00
Scott Wolchok	605f2d802a	[PyTorch] Remove unnecessary include of c10/util/Exception.h in irange.h (#136202 ) Manually audited and can't figure out why this would be needed. Differential Revision: [D62879500](https://our.internmc.facebook.com/intern/diff/D62879500/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136202 Approved by: https://github.com/malfet	2024-09-18 16:57:15 +00:00
CaoE	6a6f5b20c5	Add _addmm_activation to lower precision cast policy on AutocastCPU (#135936 ) Fixes #132613. Add `_addmm_activation` to lower precision cast policy on AutocastCPU. `_addmm_activation` https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/native/transformers/transformer.cpp#L39 of `transformer_encoder_layer_forward` may throw `RuntimeError: mat1 and mat2 must have the same dtype, but got BFloat16 and Float` when autocast is enabled, as `_native_multi_head_attention` is put in lower data type cast policy https://github.com/pytorch/pytorch/pull/107674 and `_addmm_activation` may encounter mixed data types. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135936 Approved by: https://github.com/jgong5, https://github.com/ezyang	2024-09-18 16:31:27 +00:00
Isuru Fernando	c8d152cb0e	Fix fast_expand recursion error (#136163 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136163 Approved by: https://github.com/ezyang	2024-09-18 13:58:45 +00:00
Sun, Jiayi	701ba5203f	[Inductor] Increase multiplier to 3 for Inductor AMP FP16 benchmark correctness check (#135932 ) Fix https://github.com/pytorch/pytorch/issues/135657. Aligned with AMP BF16, using multiplier 3 for Inductor AMP FP16 benchmark correctness check Pull Request resolved: https://github.com/pytorch/pytorch/pull/135932 Approved by: https://github.com/CaoE, https://github.com/jgong5, https://github.com/jansel	2024-09-18 13:03:45 +00:00
Prachi Gupta	b5be4d8c05	Fix ROCm skip decorator for test_ddp_tp and multiprocess UTs (#136161 ) skip_if_rocm is used only in multiprocess case (when UT test class is a child of MultiProcessTestCase). Each individual process can exit with a skip code. If used for single process UT, it will cause the UT to fail as the process returns a non-zero exit code. Use skipIfRocm in single process UTs. To avoid the above confusion, this PR renamed skip_if_rocm to skip_if_rocm_multiprocess. Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/136161 Approved by: https://github.com/jithunnair-amd, https://github.com/kwen2501, https://github.com/fegin	2024-09-18 11:01:23 +00:00
Menglu Yu	083c9149b7	Reland D62220158 (#136213 ) Summary: We fix the unit test test_pad_mm and reland the diff Test Plan: See in D62220158 Differential Revision: D62891584 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136213 Approved by: https://github.com/dshi7	2024-09-18 07:33:41 +00:00
Jason Ansel	a0207c8471	[dynamo] Fix support for classmethod(property(...)) (#134968 ) Fixes #134451 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134968 Approved by: https://github.com/yanboliang	2024-09-18 04:47:51 +00:00
Nikita Shulga	9aa22eabe7	[CI] Make linux-aarch64 shards actually running different tests (#136208 ) Non-functional sharding was introduced in https://github.com/pytorch/pytorch/pull/125255 but each shard in that case were running the same tests... Pull Request resolved: https://github.com/pytorch/pytorch/pull/136208 Approved by: https://github.com/seemethere, https://github.com/ZainRizvi, https://github.com/atalman	2024-09-18 03:10:21 +00:00
Kiuk Chung	8895f69d12	[torch/numpy][numpy2.0 compat] Additional changes for tests to run under numpy-2.0 (#136152 ) Continuation of https://github.com/pytorch/pytorch/pull/131909. This PR makes numpy tests compatible with numpy>=2.0.0. Specifically it deals with APIs that have been removed from numpy-2.0. Changes in this PR: 1. Use `numpy.exceptions.ComplexWarning` if `numpy.exceptions` namespace is present. In numpy-2.0 `numpy.ComplexWarning` has been removed in favor of using `numpy.exceptions.ComplexWarning` (see [numpy-2.0 migration guide](https://numpy.org/devdocs/numpy_2_0_migration_guide.html#changes-to-namespaces)). Note that `numpy.exceptions` was introduced in numpy-1.25.0 hence does not exist in numpy<=1.24.x. 2. Do the same for `numpy.exceptions.VisibleDeprecationWarning` 3. Use `np.sort(...,axis=0)` over `np.msort()`(`np.msort()` removed in numpy-2.0) 4. Use `np.pad()` over `np.lib.pad()` (`np.lib` removed in numpy-2.0) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136152 Approved by: https://github.com/atalman	2024-09-18 02:11:22 +00:00
Nikita Shulga	6682327c75	[BE] Make `NestedTensorTransformerFunctions.cu` compilable without warnings (#136222 ) Before the change compilation produced following warnings: ``` /home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/nested/cuda/NestedTensorTransformerFunctions.cu: In function ‘std::tuple<dim3, dim3, at::native::StackArray<long int> > at::native::check_shape_and_partition_(const at::Tensor&, const std::vector<at::Tensor>&, const at::Tensor&)’: /home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/nested/cuda/NestedTensorTransformerFunctions.cu:584:22: warning: comparison of integer expressions of different signedness: ‘const int’ and ‘const size_t’ {aka ‘const long unsigned int’} [-Wsign-compare] 584 \| TORCH_CHECK(num_jagged_dim <= kStackArrayMaxDims); \| ~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~ /home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/nested/cuda/NestedTensorTransformerFunctions.cu: In lambda function: /home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/nested/cuda/NestedTensorTransformerFunctions.cu:1224:1061: warning: comparison of integer expressions of different signedness: ‘long unsigned int’ and ‘int’ [-Wsign-compare] 1224 \| AT_DISPATCH_INDEX_TYPES( \| ^ /home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/nested/cuda/NestedTensorTransformerFunctions.cu: In lambda function: /home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/nested/cuda/NestedTensorTransformerFunctions.cu:1224:1985: warning: comparison of integer expressions of different signedness: ‘long unsigned int’ and ‘int’ [-Wsign-compare] 1224 \| AT_DISPATCH_INDEX_TYPES( \| ^ /home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/nested/cuda/NestedTensorTransformerFunctions.cu: In instantiation of ‘void at::native::jagged_dense_elementwise_jagged_output_opt_(const at::Tensor&, const std::vector<at::Tensor>&, const at::Tensor&, const at::Tensor&, F) [with scalar_t = c10::Half; F = __nv_dl_wrapper_t<__nv_dl_trailing_return_tag<at::Tensor (*)(const at::Tensor&, c10::ArrayRef<at::Tensor>, std::optional<c10::SymInt>), at::native::_fbgemm_dense_to_jagged_forward_symint, c10::Half, 1> >]’: /home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/nested/cuda/NestedTensorTransformerFunctions.cu:1515:1: required from here /home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/nested/cuda/NestedTensorTransformerFunctions.cu:1336:2006: warning: comparison of integer expressions of different signedness: ‘size_t’ {aka ‘long unsigned int’} and ‘int’ [-Wsign-compare] 1336 \| AT_DISPATCH_INDEX_TYPES( \| ^ /home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/nested/cuda/NestedTensorTransformerFunctions.cu:1336:2113: warning: comparison of integer expressions of different signedness: ‘size_t’ {aka ‘long unsigned int’} and ‘int’ [-Wsign-compare] 1336 \| AT_DISPATCH_INDEX_TYPES( \| ^ ``` after it compiled without a warning Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/136222 Approved by: https://github.com/PaliC, https://github.com/kit1980	2024-09-18 01:24:05 +00:00
leslie-fang-intel	b18ba9419e	[AO][Inductor] Enable WOQ fusion pattern with permute (#135928 ) Summary Fix https://github.com/pytorch/pytorch/issues/135831 and https://github.com/pytorch/ao/issues/890. The root cause of the numerical failure was that the customized woq-int8 kernel was not triggered due to changes in the pattern. After re-adding the fusion pattern, the accuracy check now passes. I will open a separate TorchAO PR to enable these unit tests in TorchAO. Test Plan ``` python test/inductor/test_mkldnn_pattern_matcher.py -k test_woq_int8 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/135928 Approved by: https://github.com/jgong5, https://github.com/eellison	2024-09-18 00:56:16 +00:00
Chirag Pandya	cccf500193	[c10d] remove sleep from watchdogHandler (#135760 ) Summary: Remove sleep from the `watchdogHandler` function. This sleep unnecessary slows things down during a NCCL timeout. Flight recorder is configured to take a minute, at most, to dump out it's buffer. This sleep ends up waiting for `8` minutes before destroy is called. Test Plan: Unit tests. Differential Revision: D62529875 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135760 Approved by: https://github.com/fduwjj, https://github.com/shuqiangzhang	2024-09-18 00:55:01 +00:00
Nikita Shulga	f6f1504d39	[MPS] Fix 5D+ reductions over negative dimentions (#136198 ) This fixes bug introduced by https://github.com/pytorch/pytorch/pull/99856 that attempts to speed-up reduction for 5D+ tensor if trailing dimensions are all ones, but introduces crashes/off-by-one errors for wrapped dimensions Added regresion test case to `TestMPS.test_sum` Fixes https://github.com/pytorch/pytorch/issues/136132 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136198 Approved by: https://github.com/albanD	2024-09-17 21:53:31 +00:00
Banit Agrawal	a575ce0dc6	[PyTorch Pinned Allocator] Add support of background thread to process events (#135524 ) Summary: Currently we process events in the regular allocation path and we call cudaEventQuery to check on the events and this path can take some locks in libcuda driver. Its not entirely needed to do process events in the allocation path, we could move this to a background thread and keep processing events regularly and put the freed block to the free list. Differential Revision: D62396585 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135524 Approved by: https://github.com/zyan0	2024-09-17 21:08:10 +00:00
Banit Agrawal	48d18fbd4c	[PyTorch CUDA Allocator] Allow reuse of non-split blocks with better rounding (#136174 ) Summary: This diff adds an option to round the non-split blocks in caching allocator so that they can be reused without causing lots of fragmentation for large memory segments. For example, if we specify max_split memory size as 400MB, then all allocations more than 400MB will not be split. Lets say, we allocated some 1024MB blocks and these are cached in the allocator blocks. If we request a new 500MB block, we round it to nearest power-2-division, thats 512MB, we add default kLargeBuffer of 20MB, that will be 532MB and since 532MB is less than existing 1024MB block, the 1024MB will not be used for this allocation, instead a new 512MB block will be created. In this diff, we provide an option to cofigure the kLargeBuffer for rounding and expose as a configurable option, so 512MB + max_non_split_rounding_size and if thats greater than 1024MB, we will use te 1024MB and we wont create a new 512MB block using cudaMalloc. This option is added so that we can pre-allocate some large blocks so that we can reuse them as much as possible and we dont stall on calling cudaMalloc. Differential Revision: D62758758 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136174 Approved by: https://github.com/zyan0	2024-09-17 19:08:44 +00:00
eqy	e3aa5e2f64	[NCCL] Don't override `waitUntilInitialized`'s setting of `comm->initialized_` (#136155 ) #133630 sets `initialized_` to `true` which causes previous wait codepaths to skip necessary waits, see also #https://github.com/pytorch/pytorch/issues/136151 CC @shuqiangzhang @wconstab Pull Request resolved: https://github.com/pytorch/pytorch/pull/136155 Approved by: https://github.com/fduwjj, https://github.com/kwen2501, https://github.com/c-p-i-o, https://github.com/shuqiangzhang	2024-09-17 18:50:12 +00:00
Huanyu He	a4e9a1c90b	[TorchRec][PT2 IR][APF] short circuit the flatten/unflatten between EBC and KTRegroupAsDict modules (#136045 ) Summary: # context * for the root cause and background please refer to this [post](https://fb.workplace.com/groups/1028545332188949/permalink/1042204770823005/) * basica idea of this diff is to short circuit the pytree flatten-unflatten function pairs between two preserved modules, i.e., EBC/fpEBC and KTRegroupAsDict. NOTE: There could be multiple EBCs and one single KTRegroupAsDict as shown in the [pic](https://fburl.com/gslide/lcyt8eh3) {F1864810545} * short-circuiting the EBC-KTRegroupAsDict pairs are very special and a must in most of the cases due to the EBC key-order issue with distributed table lookup. * hide all the operations behind a control flag `short_circuit_pytree_ebc_regroup` to the torchrec main api call `decapsulate_ir_modules`, which should only be visible to the infra layer, not to the users. # details * The `_short_circuit_pytree_ebc_regroup` function finds all the EBCs/fpEBC and KTRegroupAsDict modules in an unflattened module. Retrieve their fqns and sort to in_fqns (regroup_fqns) and out_fqns (ebc_fqns). Because currently the fpEBC is swapped as a whole, so we do some extra fqn logic to filter out the EBC that belongs to an up-level fpEBC. * a util function `prune_pytree_flatten_unflatten` removes the in-coming and out-going pytree flatten/unflatten function calls in the graph module, based on the given fqns. WARNING: The flag `short_circuit_pytree_ebc_regroup` should be turned on if EBCs are used and EBC sharding is needed. Assertions are also added if can't find a `KTRegroupAsDict` module, or `finalize_interpreter_modules` is not `True`. # additional changes * absorb the `finalize_interpreter_modules` process inside the torchrec main api `decapsulate_ir_modules`. * set `graph.owning_module` in export.unflatten as required by the graph modification * add one more layer of `sparse_module` for closely mimicing the APF model structure. Test Plan: # run test * serializer ``` buck2 run fbcode//mode/opt fbcode//torchrec/ir/tests:test_serializer ``` * apf ``` buck2 run fbcode//mode/opt fbcode//aps_models/ads/gmp/tests/ne/e2e_deterministic_tests:gmp_e2e_ne_tests -- --filter-text 'test_mtml_instagram_model_562438350_single_gpu_with_ir' ``` * local mp run ``` ==== Finished E2E deterministic test for mtml_instagram_model_gmp_474023725_non_kjt_unary ==== finished test_mtml_instagram_model_562438350_single_gpu_with_ir Imports took: 6.0s! Profile with --import-profiler. --_ \|""---__ Executed 1 example in 203.1s: \|'.\| \|\| . """\| Successful: 1 \| \|\| \|\| /\|\""-. \| Failed: 0 \| \|\| \|\| \| \| \| Skipped: 0 \| \|\| \|\| \| \\|/ \| Not executed: 8 \|."\| \|\| --"" '__\| https://testslide.readthedocs.io/ --" \|__---""" ``` Differential Revision: D62606738 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136045 Approved by: https://github.com/angelayi	2024-09-17 18:42:56 +00:00
angelayi	ea10c072f3	[export] Deserialize args with python keyword names (#136036 ) Currently when we deserialize inputs to nodes, we deserialize arguments with default values as kwargs. So deserializing `aten.uniform`, which has the signature `uniform(Tensor(a!) self, float from=0, float to=1, *, Generator? generator=None) -> Tensor(a!)`, will get become `uniform(x, from=0, to=1)`. However, this fails when running in python because `from` is a python keyword. So the solution here is to not deserialize it as a kwarg. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136036 Approved by: https://github.com/zhxchen17	2024-09-17 18:13:14 +00:00
Joel Schlosser	a8382847f4	Support rms_norm() for NJT (#135872 ) `rms_norm()` is a nice-to-have for ViT :) This PR: * SymInt-ifies `rms_norm()`, allowing NJT to use the same decomp. * Adds torch_function-based input validation logic for nested-specific stuff (no normalization supported over the ragged dim for now) on the python NJT side. * Adds multi-dim support (on non-ragged, non-batch dims) to `mean()` for NJT. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135872 Approved by: https://github.com/mikaylagawarecki ghstack dependencies: #125947	2024-09-17 18:09:20 +00:00
Nikita Shulga	785e98783b	Delete links to non-existing `run_plan_mpi.cc` (#136204 ) That were deleted by https://github.com/pytorch/pytorch/pull/125092 Fixes https://github.com/pytorch/pytorch/issues/136199 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136204 Approved by: https://github.com/albanD, https://github.com/seemethere	2024-09-17 17:51:56 +00:00
Trung Truong	cc365fdd7b	[MTIA] Support torch.cuda.get_device_capability equivalent API on MTIA (#135889 ) Summary: Mirror `get_device_capability` on MTIA per https://fburl.com/gdoc/p4lo5avn At the moment, both the major and minor version are just 0 Test Plan: Unit test: `buck2 test //mtia/host_runtime/torch_mtia/tests:test_torch_mtia_api` https://www.internalfb.com/intern/testinfra/testconsole/testrun/1688850109958190/ Differential Revision: D62595296 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135889 Approved by: https://github.com/egienvalue	2024-09-17 17:42:56 +00:00
Xintong Hu	8e5bb356e0	[PT2] Port merge_concats_pass to PT2 pre_grad passes (#135527 ) Summary: as title Test Plan: new UT Differential Revision: D62398390 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135527 Approved by: https://github.com/frank-wei	2024-09-17 17:26:53 +00:00
Nikhil Gupta	63dc5dff10	[Fix]: Update CPUINFO submodule to fix support for NON-SVE ARM Hardware (#135857 ) Regression PR : https://github.com/pytorch/cpuinfo/pull/255 Change-Id: I56cec061072be11ec33ccb661114360b979fc7aa Pull Request resolved: https://github.com/pytorch/pytorch/pull/135857 Approved by: https://github.com/digantdesai, https://github.com/malfet	2024-09-17 16:50:17 +00:00
Justin Chu	67b14ce8bd	[ONNX] Fix numpy method to return the correct type (#136162 ) Previous implementation of the `numpy()` method returns `fp64` when the tensor is `fp32`. This is unexpected but seems to be caused by calling `__array__(dtype=None)` on the numpy array. I updated the implementation to implement the `numpy()` method explicitly and added tests to guard the behavior. This needs to be cherry-picked into torch 2.5 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136162 Approved by: https://github.com/gramalingam, https://github.com/xadupre	2024-09-17 15:51:00 +00:00
Mauricio Villegas	ece8267d2c	Add back optim type hints that were lost when .pyi files were removed (#136185 ) When stub files (`.pyi`) were removed from `optim` (#125556, #125452), some types that existed are no longer available. This pull request adds them back. Just for reference, these types are used in `pytorch-lightning`'s `LightningCLI`. Command line interfaces are created automatically, and having type hints make them nicer. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136185 Approved by: https://github.com/janeyx99	2024-09-17 15:45:15 +00:00
Edward Z. Yang	913f97e878	Don't run reshape pattern match on dynamic shape size tensor (#136100 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/136100 Approved by: https://github.com/mengluy0125	2024-09-17 15:08:55 +00:00
PyTorch MergeBot	462b727d1e	Revert "Add decomposition for permute_copy (#130944 )" This reverts commit ab9a7eadd34aee59fc67e29237610b7562cc4ff0. Reverted https://github.com/pytorch/pytorch/pull/130944 on behalf of https://github.com/jeanschmidt due to Broke internal signal executorch.backends.xnnpack.test.ops.permute.TestPermute, more details on D62737086. @eellison could you please help get this PR merged to main? ([comment](https://github.com/pytorch/pytorch/pull/130944#issuecomment-2355846394))	2024-09-17 13:42:55 +00:00
PyTorch MergeBot	2c4ae81494	Revert "Add decomposition for squeeze_copy (#130941 )" This reverts commit c33b0580e6a702be0cd5be691b3b465da012aa34. Reverted https://github.com/pytorch/pytorch/pull/130941 on behalf of https://github.com/jeanschmidt due to Need to revert in order to be able to revert https://github.com/pytorch/pytorch/pull/130944, after fixing any merge conflicts, feel free to merge it back ([comment](https://github.com/pytorch/pytorch/pull/130941#issuecomment-2355831480))	2024-09-17 13:39:07 +00:00
PyTorch MergeBot	3b5e2689a1	Revert "Optimize dict reconstruct to not codegen untouched values (#134876 )" This reverts commit a1a57a424dc992f4dc2d44bdc1e4e7e500881a9c. Reverted https://github.com/pytorch/pytorch/pull/134876 on behalf of https://github.com/jeanschmidt due to new introduced test test_reconstruct.py::ReconstructTest::test_functional_call_reconstruct is breaking internally. @zou3519 may you help get those changes merged back to main? ([comment](https://github.com/pytorch/pytorch/pull/134876#issuecomment-2355697685))	2024-09-17 13:00:01 +00:00
ankurneog	e248c1d7eb	Update real device in FSDP state_dict_utils (#134994 ) ## Motivation The default device for tensor.device both for sharded as well as non sharded is set to cuda by default. Hence while checking the FSDP UTs we see the following errors. This change updates the actual device type based on the created tensor. ``` [rank3] File "/root/repos/pytorch-training-tests/tests/pytorch/v2.4.0/distributed_hpu/fsdp/test_fsdp_dtensor_state_dict.py", line 143, in test_dtensor_sharded_tensor_state_dict_identical [rank3] sharded_tensor_sd = ref_model.state_dict() [rank3] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1944, in state_dict [rank3] hook_result = hook(self, destination, prefix, local_metadata) [rank3] File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context [rank3] return func(args, kwargs) [rank3] File "/usr/local/lib/python3.10/dist-packages/torch/distributed/fsdp/_state_dict_utils.py", line 752, in _post_state_dict_hook [rank3] tensor.device, [rank3] File "/usr/local/lib/python3.10/dist-packages/typing_extensions.py", line 2853, in wrapper [rank3] return arg(args, **kwargs) [rank3] File "/usr/local/lib/python3.10/dist-packages/torch/distributed/_shard/sharded_tensor/api.py", line 1152, in __torch_function__ [rank3] return dispatch(st_instance, func) [rank3] File "/usr/local/lib/python3.10/dist-packages/torch/distributed/_shard/sharded_tensor/api.py", line 1134, in dispatch [rank3] return _SHARDED_OPS[func](types, args, kwargs, st._process_group) [rank3] File "/usr/local/lib/python3.10/dist-packages/torch/distributed/_shard/op_registry_utils.py", line 33, in wrapper [rank3] return wrapped_func(types, args, kwargs, process_group) [rank3] File "/usr/local/lib/python3.10/dist-packages/torch/distributed/_shard/sharded_tensor/_ops/tensor_ops.py", line 52, in tensor_device [rank3] dev = torch.device(torch.cuda.current_device()) [rank3] File "/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py", line 878, in current_device [rank3] _lazy_init() [rank3] File "/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py", line 305, in _lazy_init [rank3] raise AssertionError("Torch not compiled with CUDA enabled") [rank3] AssertionError: Torch not compiled with CUDA enabled ```` Pull Request resolved: https://github.com/pytorch/pytorch/pull/134994 Approved by: https://github.com/fegin	2024-09-17 04:39:08 +00:00
wz337	408fe41a45	[DSD][EZ] Minor update in _state_dict_utils.py (#136165 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136165 Approved by: https://github.com/kwen2501 ghstack dependencies: #135725, #135763	2024-09-17 04:32:43 +00:00
Brian Hirsh	dc82d274e6	make view.dtype always return an alias (#136074 ) Fixes https://github.com/pytorch/pytorch/issues/136064 In the linked repro, this issue was that there was some code like this: ``` # x has dtype torch.float32 def f(x): y = x.view(torch.float32) y.copy_(...) ``` Where because `view.dtype` is implemented today to potentially directly return its input, we would end up directly clobbering the proxy for our graph input (replacing its FX proxy value from `arg0_1` to `view_1`). This is not desirable, because we have careful assertions in AOTDispatcher that mutations only ever happen on graph inputs - but this clobbering caused the mutation to appear, from the perspective of the FX graph, like it was happening on a view of the input. Why is this normally not a problem? Ordinarily, the `ADInplaceOrView` kernel for `view.dtype` will take the output of the view kernel, [and detach() it](https://github.com/pytorch/pytorch/blob/main/tools/autograd/gen_inplace_or_view_type.py#L466) (properly creating a fresh `TensorImpl`). This does not happen, though, if you are executing the kernel from with a `__torch_dispatch__` region: the `ADInplaceOrView` logic has already run above you, so that key will be in the TLS exclude set. This PR changes eager behavior - at first I considered trying to only change behavior under compile. But this problem isn't technically specific to PT2: if you ever rely on tensor identity from inside of a __torch_dispatch__ call, then we need to make sure the raw `view.dtype` kernel doesn't directly return the input. I am also making the assumption that "`view.dtype` no-op'ing when the dtype is the same" is not a case worth optimizing in eager mode, and that the overhead of the `TensorImpl` creation is relatively negligible. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136074 Approved by: https://github.com/Skylion007, https://github.com/ezyang, https://github.com/albanD ghstack dependencies: #136041	2024-09-17 03:40:54 +00:00
Brian Hirsh	d463a81c27	inductor: dont use default_dtype during rng functionalization (#136041 ) Fixes https://github.com/pytorch/pytorch/issues/119162 See context at https://github.com/pytorch/pytorch/issues/119162#issuecomment-2349849469 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136041 Approved by: https://github.com/eellison	2024-09-17 03:40:54 +00:00
Zhijing Li (Accelerator Enablement)	3f74310784	Back out "Flip triton kernel default layout constraint to "needs_fixed_stride_order" (#135581 )" (#136160 ) Test Plan: make train-hstu-cint-publish-bf16-tgif-local Differential Revision: D62766335 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136160 Approved by: https://github.com/muchulee8	2024-09-17 01:06:10 +00:00
PyTorch MergeBot	37a08b33bb	Revert "fix compiled_autograd deadlock throw (#135795 )" This reverts commit 00dc7d435652ad66e9d2feb2660928b632281a98. Reverted https://github.com/pytorch/pytorch/pull/135795 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/135795#issuecomment-2354233619))	2024-09-16 23:59:56 +00:00
Laith Sakka	071da87cd7	use csv extention for test report in order for it to be uploaded to s3 (#136128 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136128 Approved by: https://github.com/clee2000	2024-09-16 21:47:46 +00:00
Justin Chu	c12536b3c0	[ONNX] Treat CompositeImplicitAutograd ops as normal ops in decomp (#136153 ) Since https://github.com/pytorch/pytorch/pull/135080, the CompositeImplicitAutograd (CIA) ops are only decomposed when a decomp function is provided in a table. There is no longer a need to distinguish CIA ops like Upsample and preserve them explicitly. On the ONNX Script torchlib side I will unregister some ops from the following list to make sure some CIA ops are still decomposed. ``` <OpOverload(op='aten.__and__', overload='Scalar')>, <OpOverload(op='aten.__and__', overload='Tensor')>, <OpOverload(op='aten.__or__', overload='Scalar')>, <OpOverload(op='aten.__or__', overload='Tensor')>, <OpOverload(op='aten.__xor__', overload='Scalar')>, <OpOverload(op='aten.__xor__', overload='Tensor')>, <OpOverload(op='aten._add_batch_dim', overload='default')>, <OpOverload(op='aten._assert_tensor_metadata', overload='default')>, <OpOverload(op='aten._backward', overload='default')>, <OpOverload(op='aten._batch_norm_impl_index_backward', overload='default')>, <OpOverload(op='aten._cast_Byte', overload='default')>, <OpOverload(op='aten._cast_Char', overload='default')>, <OpOverload(op='aten._cast_Double', overload='default')>, <OpOverload(op='aten._cast_Float', overload='default')>, <OpOverload(op='aten._cast_Half', overload='default')>, <OpOverload(op='aten._cast_Int', overload='default')>, <OpOverload(op='aten._cast_Long', overload='default')>, <OpOverload(op='aten._cast_Short', overload='default')>, <OpOverload(op='aten._choose_qparams_per_tensor', overload='default')>, <OpOverload(op='aten._convolution', overload='deprecated')>, <OpOverload(op='aten._convolution_double_backward', overload='default')>, <OpOverload(op='aten._convolution_mode', overload='default')>, <OpOverload(op='aten._cufft_clear_plan_cache', overload='default')>, <OpOverload(op='aten._cufft_get_plan_cache_max_size', overload='default')>, <OpOverload(op='aten._cufft_get_plan_cache_size', overload='default')>, <OpOverload(op='aten._cufft_set_plan_cache_max_size', overload='default')>, <OpOverload(op='aten._debug_has_internal_overlap', overload='default')>, <OpOverload(op='aten._dim_arange', overload='default')>, <OpOverload(op='aten._embedding_bag_sparse_backward', overload='default')>, <OpOverload(op='aten._gather_sparse_backward', overload='default')>, <OpOverload(op='aten._grid_sampler_2d_cpu_fallback_backward', overload='default')>, <OpOverload(op='aten._has_compatible_shallow_copy_type', overload='default')>, <OpOverload(op='aten._is_zerotensor', overload='default')>, <OpOverload(op='aten._lu_with_info', overload='default')>, <OpOverload(op='aten._nnpack_available', overload='default')>, <OpOverload(op='aten._pack_padded_sequence_backward', overload='default')>, <OpOverload(op='aten._pad_circular', overload='default')>, <OpOverload(op='aten._pad_enum', overload='default')>, <OpOverload(op='aten._pad_packed_sequence', overload='default')>, <OpOverload(op='aten._propagate_xla_data', overload='default')>, <OpOverload(op='aten._remove_batch_dim', overload='default')>, <OpOverload(op='aten._reshape_from_tensor', overload='default')>, <OpOverload(op='aten._rowwise_prune', overload='default')>, <OpOverload(op='aten._saturate_weight_to_fp16', overload='default')>, <OpOverload(op='aten._scaled_dot_product_attention_math', overload='default')>, <OpOverload(op='aten._shape_as_tensor', overload='default')>, <OpOverload(op='aten._sobol_engine_draw', overload='default')>, <OpOverload(op='aten._sparse_bsc_tensor_unsafe', overload='default')>, <OpOverload(op='aten._sparse_bsr_tensor_unsafe', overload='default')>, <OpOverload(op='aten._sparse_compressed_tensor_unsafe', overload='default')>, <OpOverload(op='aten._sparse_coo_tensor_unsafe', overload='default')>, <OpOverload(op='aten._sparse_csc_tensor_unsafe', overload='default')>, <OpOverload(op='aten._sparse_csr_tensor_unsafe', overload='default')>, <OpOverload(op='aten._sparse_log_softmax', overload='Dimname')>, <OpOverload(op='aten._sparse_log_softmax', overload='int')>, <OpOverload(op='aten._sparse_mm', overload='default')>, <OpOverload(op='aten._sparse_mm', overload='reduce')>, <OpOverload(op='aten._sparse_softmax', overload='Dimname')>, <OpOverload(op='aten._sparse_softmax', overload='int')>, <OpOverload(op='aten._sparse_sum', overload='default')>, <OpOverload(op='aten._sparse_sum', overload='dim_dtype')>, <OpOverload(op='aten._sparse_sum', overload='dtype')>, <OpOverload(op='aten._test_ambiguous_defaults', overload='a')>, <OpOverload(op='aten._test_ambiguous_defaults', overload='b')>, <OpOverload(op='aten._test_autograd_multiple_dispatch', overload='ntonly')>, <OpOverload(op='aten._test_check_tensor', overload='default')>, <OpOverload(op='aten._test_serialization_subcmul', overload='default')>, <OpOverload(op='aten._test_string_default', overload='default')>, <OpOverload(op='aten._thnn_differentiable_gru_cell_backward', overload='default')>, <OpOverload(op='aten._thnn_differentiable_lstm_cell_backward', overload='default')>, <OpOverload(op='aten._thnn_fused_lstm_cell_backward', overload='default')>, <OpOverload(op='aten._to_cpu', overload='default')>, <OpOverload(op='aten._upsample_bicubic2d_aa', overload='vec')>, <OpOverload(op='aten._upsample_bilinear2d_aa', overload='vec')>, <OpOverload(op='aten._upsample_nearest_exact1d', overload='default')>, <OpOverload(op='aten._upsample_nearest_exact1d', overload='vec')>, <OpOverload(op='aten._upsample_nearest_exact2d', overload='default')>, <OpOverload(op='aten._upsample_nearest_exact2d', overload='vec')>, <OpOverload(op='aten._upsample_nearest_exact3d', overload='default')>, <OpOverload(op='aten._upsample_nearest_exact3d', overload='vec')>, <OpOverload(op='aten._use_cudnn_rnn_flatten_weight', overload='default')>, <OpOverload(op='aten._validate_sparse_bsc_tensor_args', overload='default')>, <OpOverload(op='aten._validate_sparse_bsr_tensor_args', overload='default')>, <OpOverload(op='aten._validate_sparse_compressed_tensor_args', overload='default')>, <OpOverload(op='aten._validate_sparse_coo_tensor_args', overload='default')>, <OpOverload(op='aten._validate_sparse_csc_tensor_args', overload='default')>, <OpOverload(op='aten._validate_sparse_csr_tensor_args', overload='default')>, <OpOverload(op='aten._version', overload='default')>, <OpOverload(op='aten._weight_norm', overload='default')>, <OpOverload(op='aten._weight_norm_differentiable_backward', overload='default')>, <OpOverload(op='aten.absolute', overload='default')>, <OpOverload(op='aten.adaptive_avg_pool1d', overload='default')>, <OpOverload(op='aten.adaptive_avg_pool2d', overload='default')>, <OpOverload(op='aten.adaptive_avg_pool3d', overload='default')>, <OpOverload(op='aten.adaptive_max_pool1d', overload='default')>, <OpOverload(op='aten.affine_grid_generator_backward', overload='default')>, <OpOverload(op='aten.align_as', overload='default')>, <OpOverload(op='aten.align_tensors', overload='default')>, <OpOverload(op='aten.all', overload='dimname')>, <OpOverload(op='aten.any', overload='dimname')>, <OpOverload(op='aten.arccos', overload='default')>, <OpOverload(op='aten.arccosh', overload='default')>, <OpOverload(op='aten.arcsin', overload='default')>, <OpOverload(op='aten.arcsinh', overload='default')>, <OpOverload(op='aten.arctan', overload='default')>, <OpOverload(op='aten.arctan2', overload='default')>, <OpOverload(op='aten.arctanh', overload='default')>, <OpOverload(op='aten.argsort', overload='default')>, <OpOverload(op='aten.argsort', overload='dimname')>, <OpOverload(op='aten.argsort', overload='stable')>, <OpOverload(op='aten.argwhere', overload='default')>, <OpOverload(op='aten.atleast_1d', overload='Sequence')>, <OpOverload(op='aten.atleast_2d', overload='Sequence')>, <OpOverload(op='aten.atleast_3d', overload='Sequence')>, <OpOverload(op='aten.avg_pool1d', overload='default')>, <OpOverload(op='aten.bilinear', overload='default')>, <OpOverload(op='aten.broadcast_tensors', overload='default')>, <OpOverload(op='aten.can_cast', overload='default')>, <OpOverload(op='aten.cat', overload='names')>, <OpOverload(op='aten.cdist', overload='default')>, <OpOverload(op='aten.chain_matmul', overload='default')>, <OpOverload(op='aten.chalf', overload='default')>, <OpOverload(op='aten.choose_qparams_optimized', overload='default')>, <OpOverload(op='aten.clip', overload='Tensor')>, <OpOverload(op='aten.clip', overload='default')>, <OpOverload(op='aten.column_stack', overload='default')>, <OpOverload(op='aten.combinations', overload='default')>, <OpOverload(op='aten.concat', overload='default')>, <OpOverload(op='aten.concat', overload='names')>, <OpOverload(op='aten.concatenate', overload='default')>, <OpOverload(op='aten.concatenate', overload='names')>, <OpOverload(op='aten.conv1d', overload='default')>, <OpOverload(op='aten.conv1d', overload='padding')>, <OpOverload(op='aten.conv2d', overload='default')>, <OpOverload(op='aten.conv2d', overload='padding')>, <OpOverload(op='aten.conv3d', overload='default')>, <OpOverload(op='aten.conv3d', overload='padding')>, <OpOverload(op='aten.conv_tbc_backward', overload='default')>, <OpOverload(op='aten.conv_transpose1d', overload='default')>, <OpOverload(op='aten.conv_transpose2d', overload='input')>, <OpOverload(op='aten.conv_transpose3d', overload='input')>, <OpOverload(op='aten.corrcoef', overload='default')>, <OpOverload(op='aten.cosine_embedding_loss', overload='default')>, <OpOverload(op='aten.cosine_similarity', overload='default')>, <OpOverload(op='aten.cov', overload='default')>, <OpOverload(op='aten.cross', overload='default')>, <OpOverload(op='aten.cross_entropy_loss', overload='default')>, <OpOverload(op='aten.ctc_loss', overload='IntList')>, <OpOverload(op='aten.ctc_loss', overload='Tensor')>, <OpOverload(op='aten.cudnn_is_acceptable', overload='default')>, <OpOverload(op='aten.cummax', overload='dimname')>, <OpOverload(op='aten.cummaxmin_backward', overload='default')>, <OpOverload(op='aten.cummin', overload='dimname')>, <OpOverload(op='aten.cumprod', overload='dimname')>, <OpOverload(op='aten.cumprod_backward', overload='default')>, <OpOverload(op='aten.cumsum', overload='dimname')>, <OpOverload(op='aten.cumulative_trapezoid', overload='dx')>, <OpOverload(op='aten.cumulative_trapezoid', overload='x')>, <OpOverload(op='aten.data', overload='default')>, <OpOverload(op='aten.det', overload='default')>, <OpOverload(op='aten.diag', overload='default')>, <OpOverload(op='aten.diagflat', overload='default')>, <OpOverload(op='aten.diff', overload='default')>, <OpOverload(op='aten.divide', overload='Scalar')>, <OpOverload(op='aten.divide', overload='Scalar_mode')>, <OpOverload(op='aten.divide', overload='Tensor')>, <OpOverload(op='aten.divide', overload='Tensor_mode')>, <OpOverload(op='aten.dstack', overload='default')>, <OpOverload(op='aten.einsum', overload='default')>, <OpOverload(op='aten.embedding_backward', overload='default')>, <OpOverload(op='aten.embedding_bag', overload='default')>, <OpOverload(op='aten.embedding_bag', overload='padding_idx')>, <OpOverload(op='aten.embedding_sparse_backward', overload='default')>, <OpOverload(op='aten.fake_quantize_per_channel_affine', overload='default')>, <OpOverload(op='aten.fake_quantize_per_channel_affine_cachemask_backward', overload='default')>, <OpOverload(op='aten.fake_quantize_per_tensor_affine', overload='default')>, <OpOverload(op='aten.fake_quantize_per_tensor_affine', overload='tensor_qparams')>, <OpOverload(op='aten.fake_quantize_per_tensor_affine_cachemask_backward', overload='default')>, <OpOverload(op='aten.fbgemm_linear_fp16_weight', overload='default')>, <OpOverload(op='aten.fbgemm_linear_fp16_weight_fp32_activation', overload='default')>, <OpOverload(op='aten.fbgemm_linear_int8_weight', overload='default')>, <OpOverload(op='aten.fbgemm_linear_int8_weight_fp32_activation', overload='default')>, <OpOverload(op='aten.fbgemm_linear_quantize_weight', overload='default')>, <OpOverload(op='aten.fbgemm_pack_gemm_matrix_fp16', overload='default')>, <OpOverload(op='aten.fbgemm_pack_quantized_matrix', overload='KN')>, <OpOverload(op='aten.fbgemm_pack_quantized_matrix', overload='default')>, <OpOverload(op='aten.fft_fft', overload='default')>, <OpOverload(op='aten.fft_fft2', overload='default')>, <OpOverload(op='aten.fft_fftn', overload='default')>, <OpOverload(op='aten.fft_fftshift', overload='default')>, <OpOverload(op='aten.fft_hfft', overload='default')>, <OpOverload(op='aten.fft_hfft2', overload='default')>, <OpOverload(op='aten.fft_hfftn', overload='default')>, <OpOverload(op='aten.fft_ifft', overload='default')>, <OpOverload(op='aten.fft_ifft2', overload='default')>, <OpOverload(op='aten.fft_ifftn', overload='default')>, <OpOverload(op='aten.fft_ifftshift', overload='default')>, <OpOverload(op='aten.fft_ihfft', overload='default')>, <OpOverload(op='aten.fft_ihfft2', overload='default')>, <OpOverload(op='aten.fft_ihfftn', overload='default')>, <OpOverload(op='aten.fft_irfft', overload='default')>, <OpOverload(op='aten.fft_irfft2', overload='default')>, <OpOverload(op='aten.fft_irfftn', overload='default')>, <OpOverload(op='aten.fft_rfft', overload='default')>, <OpOverload(op='aten.fft_rfft2', overload='default')>, <OpOverload(op='aten.fft_rfftn', overload='default')>, <OpOverload(op='aten.fix', overload='default')>, <OpOverload(op='aten.flatten_dense_tensors', overload='default')>, <OpOverload(op='aten.fliplr', overload='default')>, <OpOverload(op='aten.flipud', overload='default')>, <OpOverload(op='aten.float_power', overload='Scalar')>, <OpOverload(op='aten.float_power', overload='Tensor_Scalar')>, <OpOverload(op='aten.float_power', overload='Tensor_Tensor')>, <OpOverload(op='aten.frobenius_norm', overload='dim')>, <OpOverload(op='aten.gather', overload='dimname')>, <OpOverload(op='aten.gather_backward', overload='default')>, <OpOverload(op='aten.ger', overload='default')>, <OpOverload(op='aten.gradient', overload='array')>, <OpOverload(op='aten.gradient', overload='scalararray')>, <OpOverload(op='aten.gradient', overload='scalarint')>, <OpOverload(op='aten.gradient', overload='scalarrayarray')>, <OpOverload(op='aten.gradient', overload='scalarrayint')>, <OpOverload(op='aten.gradient', overload='tensorarray')>, <OpOverload(op='aten.gradient', overload='tensorarrayint')>, <OpOverload(op='aten.greater', overload='Scalar')>, <OpOverload(op='aten.greater', overload='Tensor')>, <OpOverload(op='aten.greater_equal', overload='Scalar')>, <OpOverload(op='aten.greater_equal', overload='Tensor')>, <OpOverload(op='aten.grid_sampler', overload='default')>, <OpOverload(op='aten.group_norm', overload='default')>, <OpOverload(op='aten.gru', overload='data')>, <OpOverload(op='aten.gru', overload='input')>, <OpOverload(op='aten.gru_cell', overload='default')>, <OpOverload(op='aten.hinge_embedding_loss', overload='default')>, <OpOverload(op='aten.histogramdd', overload='TensorList_bins')>, <OpOverload(op='aten.histogramdd', overload='default')>, <OpOverload(op='aten.histogramdd', overload='int_bins')>, <OpOverload(op='aten.hstack', overload='default')>, <OpOverload(op='aten.index_add', overload='dimname')>, <OpOverload(op='aten.index_copy', overload='dimname')>, <OpOverload(op='aten.index_fill', overload='Dimname_Scalar')>, <OpOverload(op='aten.index_fill', overload='Dimname_Tensor')>, <OpOverload(op='aten.index_select', overload='dimname')>, <OpOverload(op='aten.index_select_backward', overload='default')>, <OpOverload(op='aten.infinitely_differentiable_gelu_backward', overload='default')>, <OpOverload(op='aten.inner', overload='default')>, <OpOverload(op='aten.instance_norm', overload='default')>, <OpOverload(op='aten.inverse', overload='default')>, <OpOverload(op='aten.is_complex', overload='default')>, <OpOverload(op='aten.is_conj', overload='default')>, <OpOverload(op='aten.is_distributed', overload='default')>, <OpOverload(op='aten.is_floating_point', overload='default')>, <OpOverload(op='aten.is_inference', overload='default')>, <OpOverload(op='aten.is_leaf', overload='default')>, <OpOverload(op='aten.is_neg', overload='default')>, <OpOverload(op='aten.is_nonzero', overload='default')>, <OpOverload(op='aten.is_signed', overload='default')>, <OpOverload(op='aten.is_vulkan_available', overload='default')>, <OpOverload(op='aten.isclose', overload='default')>, <OpOverload(op='aten.isfinite', overload='default')>, <OpOverload(op='aten.isreal', overload='default')>, <OpOverload(op='aten.istft', overload='default')>, <OpOverload(op='aten.item', overload='default')>, <OpOverload(op='aten.kl_div', overload='default')>, <OpOverload(op='aten.kron', overload='default')>, <OpOverload(op='aten.kthvalue', overload='dimname')>, <OpOverload(op='aten.l1_loss', overload='default')>, <OpOverload(op='aten.layer_norm', overload='default')>, <OpOverload(op='aten.ldexp', overload='Tensor')>, <OpOverload(op='aten.less', overload='Scalar')>, <OpOverload(op='aten.less', overload='Tensor')>, <OpOverload(op='aten.less_equal', overload='Scalar')>, <OpOverload(op='aten.less_equal', overload='Tensor')>, <OpOverload(op='aten.linalg_cholesky', overload='default')>, <OpOverload(op='aten.linalg_cond', overload='default')>, <OpOverload(op='aten.linalg_cond', overload='p_str')>, <OpOverload(op='aten.linalg_det', overload='default')>, <OpOverload(op='aten.linalg_eigh', overload='default')>, <OpOverload(op='aten.linalg_eigvals', overload='default')>, <OpOverload(op='aten.linalg_eigvalsh', overload='default')>, <OpOverload(op='aten.linalg_inv', overload='default')>, <OpOverload(op='aten.linalg_ldl_factor', overload='default')>, <OpOverload(op='aten.linalg_lu_factor', overload='default')>, <OpOverload(op='aten.linalg_matmul', overload='default')>, <OpOverload(op='aten.linalg_matrix_norm', overload='default')>, <OpOverload(op='aten.linalg_matrix_norm', overload='str_ord')>, <OpOverload(op='aten.linalg_matrix_power', overload='default')>, <OpOverload(op='aten.linalg_matrix_rank', overload='atol_rtol_float')>, <OpOverload(op='aten.linalg_matrix_rank', overload='atol_rtol_tensor')>, <OpOverload(op='aten.linalg_matrix_rank', overload='default')>, <OpOverload(op='aten.linalg_matrix_rank', overload='tol_tensor')>, <OpOverload(op='aten.linalg_multi_dot', overload='default')>, <OpOverload(op='aten.linalg_norm', overload='default')>, <OpOverload(op='aten.linalg_norm', overload='ord_str')>, <OpOverload(op='aten.linalg_pinv', overload='atol_rtol_float')>, <OpOverload(op='aten.linalg_pinv', overload='default')>, <OpOverload(op='aten.linalg_pinv', overload='rcond_tensor')>, <OpOverload(op='aten.linalg_slogdet', overload='default')>, <OpOverload(op='aten.linalg_solve', overload='default')>, <OpOverload(op='aten.linalg_solve_ex', overload='default')>, <OpOverload(op='aten.linalg_svd', overload='default')>, <OpOverload(op='aten.linalg_svdvals', overload='default')>, <OpOverload(op='aten.linalg_tensorinv', overload='default')>, <OpOverload(op='aten.linalg_tensorsolve', overload='default')>, <OpOverload(op='aten.linalg_vander', overload='default')>, <OpOverload(op='aten.linalg_vecdot', overload='default')>, <OpOverload(op='aten.linear', overload='default')>, <OpOverload(op='aten.log_sigmoid', overload='default')>, <OpOverload(op='aten.log_softmax', overload='Dimname')>, <OpOverload(op='aten.log_softmax', overload='int')>, <OpOverload(op='aten.logcumsumexp', overload='dimname')>, <OpOverload(op='aten.logdet', overload='default')>, <OpOverload(op='aten.logsumexp', overload='names')>, <OpOverload(op='aten.lstm', overload='data')>, <OpOverload(op='aten.lstm', overload='input')>, <OpOverload(op='aten.lstm_cell', overload='default')>, <OpOverload(op='aten.lu_solve', overload='default')>, <OpOverload(op='aten.margin_ranking_loss', overload='default')>, <OpOverload(op='aten.masked_select_backward', overload='default')>, <OpOverload(op='aten.matmul', overload='default')>, <OpOverload(op='aten.matrix_exp', overload='default')>, <OpOverload(op='aten.matrix_exp_backward', overload='default')>, <OpOverload(op='aten.matrix_power', overload='default')>, <OpOverload(op='aten.max', overload='names_dim')>, <OpOverload(op='aten.max', overload='other')>, <OpOverload(op='aten.max_pool1d', overload='default')>, <OpOverload(op='aten.max_pool1d_with_indices', overload='default')>, <OpOverload(op='aten.max_pool2d', overload='default')>, <OpOverload(op='aten.max_pool3d', overload='default')>, <OpOverload(op='aten.mean', overload='names_dim')>, <OpOverload(op='aten.median', overload='names_dim')>, <OpOverload(op='aten.meshgrid', overload='default')>, <OpOverload(op='aten.meshgrid', overload='indexing')>, <OpOverload(op='aten.min', overload='names_dim')>, <OpOverload(op='aten.min', overload='other')>, <OpOverload(op='aten.mish_backward', overload='default')>, <OpOverload(op='aten.mode', overload='dimname')>, <OpOverload(op='aten.msort', overload='default')>, <OpOverload(op='aten.multilabel_margin_loss', overload='default')>, <OpOverload(op='aten.multiply', overload='Scalar')>, <OpOverload(op='aten.multiply', overload='Tensor')>, <OpOverload(op='aten.nanmean', overload='default')>, <OpOverload(op='aten.nanmedian', overload='names_dim')>, <OpOverload(op='aten.nanquantile', overload='default')>, <OpOverload(op='aten.nanquantile', overload='scalar')>, <OpOverload(op='aten.native_channel_shuffle', overload='default')>, <OpOverload(op='aten.negative', overload='default')>, <OpOverload(op='aten.nested_to_padded_tensor', overload='default')>, <OpOverload(op='aten.nll_loss', overload='default')>, <OpOverload(op='aten.nll_loss2d', overload='default')>, <OpOverload(op='aten.nll_loss_nd', overload='default')>, <OpOverload(op='aten.nonzero_numpy', overload='default')>, <OpOverload(op='aten.norm', overload='names_ScalarOpt_dim')>, <OpOverload(op='aten.norm', overload='names_ScalarOpt_dim_dtype')>, <OpOverload(op='aten.norm_except_dim', overload='default')>, <OpOverload(op='aten.not_equal', overload='Scalar')>, <OpOverload(op='aten.not_equal', overload='Tensor')>, <OpOverload(op='aten.nuclear_norm', overload='default')>, <OpOverload(op='aten.nuclear_norm', overload='dim')>, <OpOverload(op='aten.one_hot', overload='default')>, <OpOverload(op='aten.orgqr', overload='default')>, <OpOverload(op='aten.outer', overload='default')>, <OpOverload(op='aten.output_nr', overload='default')>, <OpOverload(op='aten.pad', overload='default')>, <OpOverload(op='aten.pad_sequence', overload='default')>, <OpOverload(op='aten.pairwise_distance', overload='default')>, <OpOverload(op='aten.pdist', overload='default')>, <OpOverload(op='aten.pinverse', overload='default')>, <OpOverload(op='aten.poisson_nll_loss', overload='default')>, <OpOverload(op='aten.prelu', overload='default')>, <OpOverload(op='aten.prod', overload='dim_Dimname')>, <OpOverload(op='aten.promote_types', overload='default')>, <OpOverload(op='aten.qr', overload='default')>, <OpOverload(op='aten.quantile', overload='default')>, <OpOverload(op='aten.quantile', overload='scalar')>, <OpOverload(op='aten.quantized_gru_cell', overload='default')>, <OpOverload(op='aten.quantized_lstm_cell', overload='default')>, <OpOverload(op='aten.quantized_rnn_relu_cell', overload='default')>, <OpOverload(op='aten.quantized_rnn_tanh_cell', overload='default')>, <OpOverload(op='aten.relu6', overload='default')>, <OpOverload(op='aten.repeat_interleave', overload='self_Tensor')>, <OpOverload(op='aten.repeat_interleave', overload='self_int')>, <OpOverload(op='aten.result_type', overload='Scalar')>, <OpOverload(op='aten.result_type', overload='Scalar_Scalar')>, <OpOverload(op='aten.result_type', overload='Scalar_Tensor')>, <OpOverload(op='aten.result_type', overload='Tensor')>, <OpOverload(op='aten.retains_grad', overload='default')>, <OpOverload(op='aten.rms_norm', overload='default')>, <OpOverload(op='aten.rnn_relu', overload='data')>, <OpOverload(op='aten.rnn_relu', overload='input')>, <OpOverload(op='aten.rnn_relu_cell', overload='default')>, <OpOverload(op='aten.rnn_tanh', overload='data')>, <OpOverload(op='aten.rnn_tanh', overload='input')>, <OpOverload(op='aten.rnn_tanh_cell', overload='default')>, <OpOverload(op='aten.row_stack', overload='default')>, <OpOverload(op='aten.rrelu', overload='default')>, <OpOverload(op='aten.scaled_dot_product_attention', overload='default')>, <OpOverload(op='aten.scatter', overload='dimname_src')>, <OpOverload(op='aten.scatter', overload='dimname_value')>, <OpOverload(op='aten.scatter_add', overload='dimname')>, <OpOverload(op='aten.selu', overload='default')>, <OpOverload(op='aten.silu_backward', overload='default')>, <OpOverload(op='aten.size', overload='Dimname')>, <OpOverload(op='aten.size', overload='int')>, <OpOverload(op='aten.slogdet', overload='default')>, <OpOverload(op='aten.slow_conv3d', overload='default')>, <OpOverload(op='aten.smm', overload='default')>, <OpOverload(op='aten.softmax', overload='Dimname')>, <OpOverload(op='aten.softmax', overload='int')>, <OpOverload(op='aten.sort', overload='dimname')>, <OpOverload(op='aten.sort', overload='dimname_stable')>, <OpOverload(op='aten.sparse_bsc_tensor', overload='ccol_row_value')>, <OpOverload(op='aten.sparse_bsc_tensor', overload='ccol_row_value_size')>, <OpOverload(op='aten.sparse_bsr_tensor', overload='crow_col_value')>, <OpOverload(op='aten.sparse_bsr_tensor', overload='crow_col_value_size')>, <OpOverload(op='aten.sparse_coo_tensor', overload='indices')>, <OpOverload(op='aten.sparse_coo_tensor', overload='indices_size')>, <OpOverload(op='aten.sparse_csc_tensor', overload='ccol_row_value')>, <OpOverload(op='aten.sparse_csc_tensor', overload='ccol_row_value_size')>, <OpOverload(op='aten.sparse_csr_tensor', overload='crow_col_value')>, <OpOverload(op='aten.sparse_csr_tensor', overload='crow_col_value_size')>, <OpOverload(op='aten.special_digamma', overload='default')>, <OpOverload(op='aten.special_erf', overload='default')>, <OpOverload(op='aten.special_erfc', overload='default')>, <OpOverload(op='aten.special_erfinv', overload='default')>, <OpOverload(op='aten.special_exp2', overload='default')>, <OpOverload(op='aten.special_expit', overload='default')>, <OpOverload(op='aten.special_expm1', overload='default')>, <OpOverload(op='aten.special_gammainc', overload='default')>, <OpOverload(op='aten.special_gammaincc', overload='default')>, <OpOverload(op='aten.special_gammaln', overload='default')>, <OpOverload(op='aten.special_i0', overload='default')>, <OpOverload(op='aten.special_log1p', overload='default')>, <OpOverload(op='aten.special_log_softmax', overload='default')>, <OpOverload(op='aten.special_logit', overload='default')>, <OpOverload(op='aten.special_logsumexp', overload='default')>, <OpOverload(op='aten.special_multigammaln', overload='default')>, <OpOverload(op='aten.special_ndtr', overload='default')>, <OpOverload(op='aten.special_polygamma', overload='default')>, <OpOverload(op='aten.special_psi', overload='default')>, <OpOverload(op='aten.special_round', overload='default')>, <OpOverload(op='aten.special_sinc', overload='default')>, <OpOverload(op='aten.special_softmax', overload='default')>, <OpOverload(op='aten.special_xlogy', overload='default')>, <OpOverload(op='aten.special_xlogy', overload='other_scalar')>, <OpOverload(op='aten.special_xlogy', overload='self_scalar')>, <OpOverload(op='aten.square', overload='default')>, <OpOverload(op='aten.sspaddmm', overload='default')>, <OpOverload(op='aten.std', overload='correction_names')>, <OpOverload(op='aten.std', overload='default')>, <OpOverload(op='aten.std', overload='dim')>, <OpOverload(op='aten.std', overload='names_dim')>, <OpOverload(op='aten.std_mean', overload='correction_names')>, <OpOverload(op='aten.std_mean', overload='default')>, <OpOverload(op='aten.std_mean', overload='dim')>, <OpOverload(op='aten.std_mean', overload='names_dim')>, <OpOverload(op='aten.stft', overload='center')>, <OpOverload(op='aten.stft', overload='default')>, <OpOverload(op='aten.stride', overload='Dimname')>, <OpOverload(op='aten.stride', overload='int')>, <OpOverload(op='aten.subtract', overload='Scalar')>, <OpOverload(op='aten.subtract', overload='Tensor')>, <OpOverload(op='aten.sum', overload='dim_DimnameList')>, <OpOverload(op='aten.sum_to_size', overload='default')>, <OpOverload(op='aten.svd', overload='default')>, <OpOverload(op='aten.sym_size', overload='int')>, <OpOverload(op='aten.sym_stride', overload='int')>, <OpOverload(op='aten.take_along_dim', overload='default')>, <OpOverload(op='aten.tensordot', overload='default')>, <OpOverload(op='aten.thnn_conv2d', overload='default')>, <OpOverload(op='aten.tile', overload='default')>, <OpOverload(op='aten.to_dense', overload='default')>, <OpOverload(op='aten.to_dense_backward', overload='default')>, <OpOverload(op='aten.to_mkldnn_backward', overload='default')>, <OpOverload(op='aten.to_sparse', overload='default')>, <OpOverload(op='aten.to_sparse', overload='sparse_dim')>, <OpOverload(op='aten.to_sparse_bsc', overload='default')>, <OpOverload(op='aten.to_sparse_bsr', overload='default')>, <OpOverload(op='aten.to_sparse_csc', overload='default')>, <OpOverload(op='aten.to_sparse_csr', overload='default')>, <OpOverload(op='aten.trace_backward', overload='default')>, <OpOverload(op='aten.trapezoid', overload='dx')>, <OpOverload(op='aten.trapezoid', overload='x')>, <OpOverload(op='aten.trapz', overload='dx')>, <OpOverload(op='aten.trapz', overload='x')>, <OpOverload(op='aten.triplet_margin_loss', overload='default')>, <OpOverload(op='aten.true_divide', overload='Scalar')>, <OpOverload(op='aten.true_divide', overload='Tensor')>, <OpOverload(op='aten.type_as', overload='default')>, <OpOverload(op='aten.unflatten_dense_tensors', overload='default')>, <OpOverload(op='aten.upsample_bicubic2d', overload='vec')>, <OpOverload(op='aten.upsample_bilinear2d', overload='vec')>, <OpOverload(op='aten.upsample_linear1d', overload='vec')>, <OpOverload(op='aten.upsample_nearest1d', overload='default')>, <OpOverload(op='aten.upsample_nearest1d', overload='vec')>, <OpOverload(op='aten.upsample_nearest2d', overload='default')>, <OpOverload(op='aten.upsample_nearest2d', overload='vec')>, <OpOverload(op='aten.upsample_nearest3d', overload='default')>, <OpOverload(op='aten.upsample_nearest3d', overload='vec')>, <OpOverload(op='aten.upsample_trilinear3d', overload='vec')>, <OpOverload(op='aten.value_selecting_reduction_backward', overload='default')>, <OpOverload(op='aten.vander', overload='default')>, <OpOverload(op='aten.var', overload='correction_names')>, <OpOverload(op='aten.var', overload='default')>, <OpOverload(op='aten.var', overload='dim')>, <OpOverload(op='aten.var', overload='names_dim')>, <OpOverload(op='aten.var_mean', overload='correction_names')>, <OpOverload(op='aten.var_mean', overload='default')>, <OpOverload(op='aten.var_mean', overload='dim')>, <OpOverload(op='aten.var_mean', overload='names_dim')>, <OpOverload(op='aten.vstack', overload='default')>, <OpOverload(op='aten.where', overload='Scalar')>, <OpOverload(op='aten.where', overload='ScalarOther')>, <OpOverload(op='aten.where', overload='ScalarSelf')>, <OpOverload(op='aten.where', overload='default')>, <OpOverload(op='aten.wrapped_linear_prepack', overload='default')>, <OpOverload(op='aten.wrapped_quantized_linear_prepacked', overload='default')> ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/136153 Approved by: https://github.com/xadupre, https://github.com/gramalingam	2024-09-16 21:28:54 +00:00
Pearu Peterson	b76d1b79e6	Add scaling arguments to bsr_dense_addmm (#136104 ) As in the title. Tackles https://github.com/pytorch/ao/pull/821/files#r1759821413 The PR assumes that the existing tuning parameters are good also when using scaling arguments. This needs to be verified as a follow-up task. Also, this PR redefines triton-contiguous tensors: the tensor must have strides not larger than 1. This will now allow zero strides that previously triggered `contiguous` call although the underlying memory buffer was contiguous. Re: "a considerable slow-down occurs because tensor data is copied element-wise rather than chunk-wise" - this note should refer to a code (torch or triton?) that implements the element/chunk-wise copy so that we could verify that allowing zero strides indeed would not trigger element-wise copies. Atm, the performance increase in ViT-H benchmarks (that involve using 0 strides) is an evidence that allowing zero strides does not lead to slow-downs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136104 Approved by: https://github.com/cpuhrsch	2024-09-16 20:26:54 +00:00
PyTorch MergeBot	bfbcdf4967	Revert "[dynamo] Fix support for classmethod(property(...)) (#134968 )" This reverts commit c64ae601ba9eb3ad2cd3402a14f6ac83c0ab7eba. Reverted https://github.com/pytorch/pytorch/pull/134968 on behalf of https://github.com/jeanschmidt due to Breaking internal signals, we need to skip the new tests on py3.10 ([comment](https://github.com/pytorch/pytorch/pull/134968#issuecomment-2353909010))	2024-09-16 20:26:35 +00:00
Dan Johnson	3c97b0ab00	Use ncclAlltoAllv and ncclAlltoAll API when supported (#134499 ) NCCL does not have an api for ncclAllToAll and ncclAllToAllv, so PyTorch does point to point send/recv. Expose this API if it is supported. Differential Revision: [D61683836](https://our.internmc.facebook.com/intern/diff/D61683836/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134499 Approved by: https://github.com/shuqiangzhang, https://github.com/eqy	2024-09-16 20:08:06 +00:00
Kiuk Chung	abd16a8c64	[torch/multiprocessing] Use multiprocessing.reduction.register ForkingPickler.register to register custom tensor and storage reductions (#135030 ) Right now `multiprocessing.reduction.register()` is simply an alias to `multiprocessing.reduction.ForkingPickler.register()` https://github.com/python/cpython/blame/main/Lib/multiprocessing/reduction.py#L56, but the top-level `register()` function exposes less of the internal details of `multiprocessing.reduction` module. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135030 Approved by: https://github.com/albanD	2024-09-16 20:07:29 +00:00
fduwjj	a0c7029a75	[c10d][Reland] Remove Option for ProcessGroup and Expose backend Options to reflect the correct code structure (#132931 ) (#135653 ) We introduced the dispatchable backend for a ProcessGroup and collective in https://github.com/pytorch/pytorch/issues/86225. This PR is a follow-up cleanup to clean up the option of a ProcessGroup and ask users to either set timeout or backend later on or directly create backend after creating a PG. Also PGNCCL is using option class from ProcessGroup but we actually should use Option from backend class. So this PR is to make the type or name to be aligned with what we are doing in cpp side. I don't change the signature for the public API, so they still use args named "pg_options" We need to make changes to the test to make it aligned with the change. This is try to reland D62008954 by fixing internal errors. Differential Revision: [D62483294](https://our.internmc.facebook.com/intern/diff/D62483294/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135653 Approved by: https://github.com/wz337, https://github.com/H-Huang	2024-09-16 19:56:42 +00:00
James Wu	7537f74277	Refactor FxGraphCache.load into separate functions, so that AOTAutogradCache may access it correctly later (#135491 ) Summary: We refactor FxGraphCache.load into three phases: - prepare_key, which checks that an inductor input is cacheable and bypasses otherwise - load_with_key, which tries to lookup the key in the cache - post compile, where we do some logging and run post compile steps Splitting it along these lines will allow AOTAutogradCache to use load_with_key and still get access to all of the observability + remote cache logic when accessing FxGraphCache, without needing to pass key components, etc. Differential Revision: D62314862 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135491 Approved by: https://github.com/oulgen	2024-09-16 19:48:08 +00:00
Aaron Gokaslan	31715be72a	[BE]: Update mypy to 1.11.2 (#133816 ) Updates mypy to 1.11.1 to improve type inference Pull Request resolved: https://github.com/pytorch/pytorch/pull/133816 Approved by: https://github.com/ezyang	2024-09-16 19:44:11 +00:00
Nikita Shulga	38caf10411	[EZ] Fix spelling typo (#136157 ) s/toosl/tools/ (spotted by @louie-tsai) Also, capitalize CUDA Pull Request resolved: https://github.com/pytorch/pytorch/pull/136157 Approved by: https://github.com/kit1980	2024-09-16 19:30:30 +00:00
Ke Wen	c977bb7d03	[Distributed] fix FileSystemWriter __init__ (#136135 ) Fixes #135608. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136135 Approved by: https://github.com/Skylion007	2024-09-16 19:11:08 +00:00
eugenekoran	717fca2cac	Drop outdated section 'Running clang-tidy' in CONTRIBUTING.md (#136146 ) Fixes #125920 [Running clang-tidy](https://github.com/pytorch/pytorch/blob/main/CONTRIBUTING.md#running-clang-tidy) section is misleading and outdated. C++ lint is done with lintrunner and covered in [local-linting](https://github.com/pytorch/pytorch/blob/main/CONTRIBUTING.md#local-linting) section. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136146 Approved by: https://github.com/janeyx99	2024-09-16 19:02:21 +00:00
Alexander Kurakin	f89ce4dfbb	`torch.nn.MultiheadAttention`: docs: improvement (#136111 ) `torch.nn.MultiheadAttention`: docs: improvement Pull Request resolved: https://github.com/pytorch/pytorch/pull/136111 Approved by: https://github.com/janeyx99	2024-09-16 18:52:20 +00:00
Nikita Shulga	d3647d15e6	Remove accidentally committed code (#136154 ) Accidentally left out during rebase Pull Request resolved: https://github.com/pytorch/pytorch/pull/136154 Approved by: https://github.com/kit1980, https://github.com/albanD	2024-09-16 18:34:20 +00:00
PyTorch MergeBot	d0cebedb31	Revert "Add Triton CPU as an Inductor backend (#133408 )" This reverts commit e498b02b472e45cfd6b7a08db0d6c1babec655c5. Reverted https://github.com/pytorch/pytorch/pull/133408 on behalf of https://github.com/jeanschmidt due to Broke internal signals, see D62737208 for more details ([comment](https://github.com/pytorch/pytorch/pull/133408#issuecomment-2353623816))	2024-09-16 18:33:33 +00:00
PyTorch MergeBot	7fe004f7cf	Revert "Add CI for Triton CPU backend (#135342 )" This reverts commit 426580a67db15ec17b2b861a09667bf59927e033. Reverted https://github.com/pytorch/pytorch/pull/135342 on behalf of https://github.com/jeanschmidt due to Broke internal signals, see D62737208 for more details ([comment](https://github.com/pytorch/pytorch/pull/133408#issuecomment-2353623816))	2024-09-16 18:33:33 +00:00
Aaron Gokaslan	23c0d2689e	[BE][Ez]: Fix missing float16 coverage for adaptive_pool3d_cpu (#136091 ) Testing if op info coverage has issues Pull Request resolved: https://github.com/pytorch/pytorch/pull/136091 Approved by: https://github.com/ezyang	2024-09-16 18:22:16 +00:00
Suresh Babu Kolla	5193f23469	[Pytorch] Cleanup Strobelight URL and shorten for readability (#136102 ) Summary: - Converted strobelight URL prefix to more readable and editable json - Dump shortened URLs when possible for easier readability Test Plan: ``` python ./torch/_strobelight/examples/compile_time_profile_example.py python torch/_strobelight/examples/cli_function_profiler_example.py ``` Differential Revision: D62690292 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136102 Approved by: https://github.com/laithsakka	2024-09-16 18:10:33 +00:00
PyTorch MergeBot	0199fd4d7e	Revert "[inductor] More fixes on the keys of `constants` and `signature` dictionaries (#135406 )" This reverts commit e54b559e8860e343692bb5534777b2384a57a613. Reverted https://github.com/pytorch/pytorch/pull/135406 on behalf of https://github.com/jeanschmidt due to Reverting as it is breaking triton_mtia internal signals @jansel could you have a look and help get those changes merged? ([comment](https://github.com/pytorch/pytorch/pull/135406#issuecomment-2353557481))	2024-09-16 17:58:02 +00:00
Aaron Gokaslan	b491e2974c	[BE][Ez]: Add full half/bfloat16 dtype for `unique` and `isin` (#136114 ) Fixes #136090 * Add support for isin to tensor half dtypes for CPU (just add a few extra dispatches). * Seems like the CUDA implementation for bfloat16 was mostly compiled and available all along (it just calls sort internally AND unique). To enable it, we just need to remove an assert to access it (since sort's functionality was updated since the assert was added) and add missing dtype support to unique. * This unlocks more GPU functionality with minimal code bloat. I also added CPU kernels for the dtypes for parity. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136114 Approved by: https://github.com/malfet	2024-09-16 17:49:12 +00:00
Justin Chu	0aa41eb52f	[ONNX] Run type promotion test in CI and update the table (#135915 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135915 Approved by: https://github.com/gramalingam, https://github.com/xadupre	2024-09-16 16:46:13 +00:00
IvanKobzarev	090046b936	[effects] Turn off dtype promotion for with_effects lowering (#136039 ) By default inductor promotes arguments to the common highest dtype. Having empty token with dtype=torch.float32 results in dtype promotion for effectful ops during lowering of with_effects. Disabling dtype promotion for this lowering. Removing previous workaround making token dtype torch.bool. Testing: ``` python test/distributed/test_c10d_functional_native.py -k test_inductor_dtypeview_memory_lea ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/136039 Approved by: https://github.com/bdhirsh, https://github.com/eellison, https://github.com/zou3519	2024-09-16 16:14:05 +00:00
Tom Ritchford	c33b0580e6	Add decomposition for squeeze_copy (#130941 ) * Extracted from #128416 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130941 Approved by: https://github.com/amjames, https://github.com/eellison	2024-09-16 15:46:57 +00:00
Jon Janzen	13bd1256f9	Delete stable prototype (#135911 ) This project ended up going in an entirely different direction, so we can close out all this Pull Request resolved: https://github.com/pytorch/pytorch/pull/135911 Approved by: https://github.com/izaitsevfb, https://github.com/malfet	2024-09-16 15:32:17 +00:00
Bin Bao	d833f49602	[reland][Inductor] Rename `cpp_wrapper_cuda.py` as `cpp_wrapper_gpu.py` (#136046 ) Summary: Reland https://github.com/pytorch/pytorch/pull/135313 after fixing internal build issues Test Plan: CI Differential Revision: D62658837 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136046 Approved by: https://github.com/chenyang78, https://github.com/etaf, https://github.com/jansel	2024-09-16 14:35:19 +00:00
Bin Bao	a803cb0531	[AOTI] Refactor how cpp_wrapper specific options are set (#136035 ) Summary: 1) When cpp-wrapper is turned on, certain triton specific options need to be set, both for forward and backward. This PR considate the settings in one place. 2) Change config.triton.autotune_at_compile_time to default to None. If the flag is not explicitly set by user, default it to True for cpp-wrapper. Differential Revision: [D62689940](https://our.internmc.facebook.com/intern/diff/D62689940) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136035 Approved by: https://github.com/chenyang78	2024-09-16 14:32:13 +00:00
atalman	bbc3fdbbde	Add python 3.13.0t build to Docker images (#136001 ) Adds 3.13t python to Docker images Pull Request resolved: https://github.com/pytorch/pytorch/pull/136001 Approved by: https://github.com/albanD	2024-09-16 12:49:36 +00:00
PyTorch MergeBot	3117f2cf67	Revert "[BE]: Update mypy to 1.11.2 (#133816 )" This reverts commit 55299cfc223fa838aadd8d6d6fa3ed541fa5acd1. Reverted https://github.com/pytorch/pytorch/pull/133816 on behalf of https://github.com/jeanschmidt due to seems to have broken https://github.com/pytorch/pytorch/actions/runs/10865710499/job/30155699792 on main ([comment](https://github.com/pytorch/pytorch/pull/133816#issuecomment-2352377684))	2024-09-16 09:11:16 +00:00
Xuehai Pan	951c21d679	[dynamo] simplify implementation for `builtins.sum` (#133779 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133779 Approved by: https://github.com/jansel, https://github.com/anijain2305 ghstack dependencies: #133778	2024-09-16 04:53:06 +00:00
Xuehai Pan	9961aaa601	[dynamo] simplify implementation for `functools.reduce` (#133778 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133778 Approved by: https://github.com/jansel, https://github.com/anijain2305	2024-09-16 04:53:06 +00:00
Ke Wen	d2207c57f7	[Distributed] add pack-check method for float8_e5m2 (#136115 ) Add support for Float8_e5m2, following similar algorithm used for Float8_e4m3fn (i.e. overflow check). Made `HasNanFP8x8` a template so that it is extendable based on dtype. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136115 Approved by: https://github.com/Skylion007 ghstack dependencies: #135891, #135961	2024-09-15 21:37:43 +00:00
Howard Huang	e501ed71d4	Update link in distributed.tensor.parallel.rst (#136103 ) dtensor folder was moved Pull Request resolved: https://github.com/pytorch/pytorch/pull/136103 Approved by: https://github.com/kwen2501, https://github.com/fegin	2024-09-15 19:36:29 +00:00
Tom Ritchford	ab9a7eadd3	Add decomposition for permute_copy (#130944 ) * Extracted from #129476 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130944 Approved by: https://github.com/amjames, https://github.com/eellison	2024-09-15 19:35:14 +00:00
Andrii Grynenko	a141c6bb0d	[pytorch][monitoring] Dynamic backend for WaitCounter (#135967 ) Summary: This implements a default backend proxy that tries to look up a backend via dlsym. What this enables is dynamically loading a module with a backend implementation without having it statically linked with the application. Differential Revision: D62549295 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135967 Approved by: https://github.com/c-p-i-o	2024-09-15 18:07:49 +00:00
Tugsbayasgalan Manlaibaatar	dec3403b24	Add some doc for export_for_training (#135918 ) Differential Revision: [D62610491](https://our.internmc.facebook.com/intern/diff/D62610491) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135918 Approved by: https://github.com/avikchaudhuri ghstack dependencies: #135080, #135912	2024-09-15 17:08:12 +00:00
Tugsbayasgalan Manlaibaatar	1904b09e61	Create export_for_inference API and expose core_aten as public facing API (#135912 ) Differential Revision: [D62606908](https://our.internmc.facebook.com/intern/diff/D62606908) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135912 Approved by: https://github.com/avikchaudhuri ghstack dependencies: #135080	2024-09-15 17:05:07 +00:00
Tugsbayasgalan Manlaibaatar	382fad58b3	Deprecate _preserve_ops and consolidate with decomp_table (#135080 ) In this PR, we deprecate _preserve_ops feature in run_decomposition API. We can't kill this API completely because Executorch team depends on it. As the syncing between two repos is non-trivial, I just leave this argument as deprecated for now. In the next PR, i will immediately remove it. After this PR, run_decompositions will only decompose what's inside the decomp table and preserve the rest by default. Note that this feature is only rolled out to OSS for now. Old code path is protected under IS_FBCODE flag. Differential Revision: [D62163161](https://our.internmc.facebook.com/intern/diff/D62163161/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135080 Approved by: https://github.com/justinchuby, https://github.com/avikchaudhuri, https://github.com/bdhirsh	2024-09-15 17:01:58 +00:00
PyTorch MergeBot	357b7fb579	Revert "[Pytorch] Consolidate Strobelight compile time profiler between OSS and fbcode (#135953 )" This reverts commit b8637503c036abb898f6b880b325aeffe6f09c03. Reverted https://github.com/pytorch/pytorch/pull/135953 on behalf of https://github.com/kollasb due to Broke internal module factory compatibility, revert from Phabricator failed ([comment](https://github.com/pytorch/pytorch/pull/135953#issuecomment-2351381777))	2024-09-15 05:32:38 +00:00
cyy	31e42a45dd	Fix redundant move warnings by g++ (#134987 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/134987 Approved by: https://github.com/ezyang	2024-09-15 05:28:19 +00:00
PyTorch UpdateBot	e1abd346a3	[audio hash update] update the pinned audio hash (#136106 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned audio hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136106 Approved by: https://github.com/pytorchbot	2024-09-15 04:31:35 +00:00
Will Feng	386884e553	[Traceable FSDP2] Ignore FSDP2 forward hook side-effects in AC; Support FSDP2 + AC (#134997 ) > Ignore FSDP2 forward hook side-effects in AC Under AC, FSDP2 does not rely on forward hook to all-gather weights to do recomputation, instead it relies on pre-backward hook to do this job: `451eaf0ff2/torch/distributed/_composable/fsdp/_fsdp_state.py (L219-L220)` So when we use `speculate_subgraph` to trace the utils.checkpoint AC region, we don't actually need to worry about FSDP2 forward hook's side effects and can safely ignore it, because we are not and we don't expect to re-run the FSDP2 forward hook during backward recomputation. ---- Test commands: - `pytest -rA test/distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_nested_fully_shard_backend_inductor` - `pytest -rA test/distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_transformer_backend_inductor` Pull Request resolved: https://github.com/pytorch/pytorch/pull/134997 Approved by: https://github.com/zou3519 ghstack dependencies: #135727	2024-09-15 02:00:17 +00:00
leslie-fang-intel	8072ebc36c	SKIP llama for dynamic size testing (#135960 ) Running Torchbench llama with dynamic size failed with ``` File "/localdisk/leslie/torch_inductor_community/pytorch/torch/fx/experimental/symbolic_shapes.py", line 4182, in produce_guards raise ConstraintViolationError( torch.fx.experimental.symbolic_shapes.ConstraintViolationError: Constraints violated (L['inputs'][0].size()[0])! For more information, run with TORCH_LOGS="+dynamic". - Not all values of RelaxedUnspecConstraint(L['inputs'][0].size()[0]) are valid because L['inputs'][0].size()[0] was inferred to be a constant (32). ``` Skip this model for marking dynamic dim. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135960 Approved by: https://github.com/ezyang	2024-09-15 00:06:49 +00:00
Guilherme Leobas	a1a57a424d	Optimize dict reconstruct to not codegen untouched values (#134876 ) PR changes how `reconstruct` is done for a ConstDict. As of today, it works as follow: (1) codegen(...) each pair of key/value (2) create a new dictionary to hold the new items (3) clear the original dictionary (4) update the original dict with the one created in (2) We do a micro optimization in the generated bytecode to: - Only codegen the items that changed. - Only clear the original dictionary if a key was removed. Fixes: #133487 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134876 Approved by: https://github.com/zou3519	2024-09-14 23:25:28 +00:00
Bob Ren	a5eb43d8b4	Add TensorReferenceAnalysis and some tests (#135886 ) Split out and modified from https://github.com/pytorch/pytorch/pull/130228. There were a bunch of subtle bugs eg. sometimes we need to use torch.ops.aten.{operator}.Tensor vs other times using torch.ops.aten.{operator}.default. Or in the case of pow we need to use Tensor_Tensor. I figured it'd be easier to split out adding TensorReferenceAnalysis and add some tests and do the actual integration in a separate diff. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135886 Approved by: https://github.com/ezyang	2024-09-14 23:09:40 +00:00
Isuru Fernando	391f2d6d50	use a fast expand algorithm (#135999 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135999 Approved by: https://github.com/ezyang	2024-09-14 23:09:34 +00:00
Isuru Fernando	5b21d91197	Fix dividing Mul by factor (#136079 ) Fixes https://github.com/pytorch/pytorch/issues/136032 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136079 Approved by: https://github.com/ezyang	2024-09-14 22:14:27 +00:00
Jez Ng	426580a67d	Add CI for Triton CPU backend (#135342 ) Where possible, I have marked failing tests (which we intend to fix or triage) as `@xfail_if_triton_cpu`. This will help us track progress of the Triton CPU backend over time. Tests that I don't think we need to address, or that are flaky, have been marked as skips. Successful CI run: https://github.com/pytorch/pytorch/actions/runs/10822238062/job/30028284549 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135342 Approved by: https://github.com/jansel ghstack dependencies: #133408	2024-09-14 21:45:19 +00:00
Jez Ng	e498b02b47	Add Triton CPU as an Inductor backend (#133408 ) The goal is to use Inductor-generated kernels to stress test the new Triton CPU backend. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133408 Approved by: https://github.com/jansel	2024-09-14 21:45:19 +00:00
Aaron Gokaslan	55299cfc22	[BE]: Update mypy to 1.11.2 (#133816 ) Updates mypy to 1.11.1 to improve type inference Pull Request resolved: https://github.com/pytorch/pytorch/pull/133816 Approved by: https://github.com/ezyang	2024-09-14 21:40:36 +00:00
Jason Ansel	c64ae601ba	[dynamo] Fix support for classmethod(property(...)) (#134968 ) Fixes #134451 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134968 Approved by: https://github.com/yanboliang	2024-09-14 21:00:41 +00:00
Aaron Gokaslan	7f5abb44af	[BE][Ez]: Update pybind11 to 2.13.6. Exposes new conduit cross-compat API (#136087 ) Updates pybind11 submodule. The major patchnote is an experimental new function that is added to all pybind11 objects that will make them more compatible across pybind11 version, settings, and frameworks (such as nanobind) called cpp_conduit. No code changes needed on our end except to update Pull Request resolved: https://github.com/pytorch/pytorch/pull/136087 Approved by: https://github.com/malfet	2024-09-14 20:48:44 +00:00
Michael Lazos	8df01c8258	[Dynamo] Remove ignored modes from torch function mode stack guard (#135503 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135503 Approved by: https://github.com/anijain2305 ghstack dependencies: #134732, #133137, #135443, #135444, #135422, #135502	2024-09-14 18:52:22 +00:00
Michael Lazos	860838e9be	[Dynamo] Remove ignored modes workaround (#135502 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135502 Approved by: https://github.com/anijain2305 ghstack dependencies: #134732, #133137, #135443, #135444, #135422	2024-09-14 18:52:22 +00:00
Michael Lazos	1b9daeb240	[Dynamo] Trace enter/exit of TorchFunctionModes (#135422 ) This PR implements tracing of with contexts with TorchFunction modes which have the default enter/exit behavior (ie pushing/popping the mode) Typically the bytecode for a context manager looks like this during a graph break: 1. graph call 2. enter context 3. unsupported code 4. exit context 5. resume call resume fn structure: 1. enter context 2. jump ... 3. exit context The issue with torch function modes is that side effects will replay any mutations to the torch function stack performed during tracing. So, we do not need to enter and exit around the unsupported code in the original function (doing so would result in a duplicate torch function mode entry during execution of the unsupported code), and we don't need to enter again in the resume function (the mode that was pushed from the side effects bytecode would still be on the stack). So for torch function modes the structure of our output code is this: 1. graph call 2. mutate tf mode stack to replay mutations 4. unsupported code 5. on exception restore stack 6. resume function Then our resume fn looks like this: 1. no-op enter torch function mode 2. jump 3. exit tf mode To implement the no-op enter of the torch function mode I added torch function mode in polyfill which no-op enters, but normally exits. This is needed because we still want to trace the with context in the resume function, and exit properly (the exit instructions will still be in the function, so we need to generate instructions to set up the context). Separately from the bytecode, dynamo also tracks contexts on the block stack, which is how the SETUP_* instructions are implemented. Naturally at a graph break, we exit these block stacks to properly reset the contexts entirely, so that we can re-enter around the unsupported code soundly. However once again, in the torch function mode case, in the event of a graph we do not want to perform any exit side effects because we want to preserve the state of the mode stack as is so that we will properly update the stack with bytecode mentioned in the first section. If we exited here, dynamo would pop the mode off of the symbolic stack, and not update the true python torch function mode stack with the suffix bytecode. All in all, for torch function modes we enter exactly once, update the global torch function mode stack with side effects bytecode, re-read this stack when compiling the resume function, and exit exactly once in the resume function. This matches the semantics of eager exactly. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135422 Approved by: https://github.com/williamwen42 ghstack dependencies: #134732, #133137, #135443, #135444	2024-09-14 18:52:22 +00:00
Michael Lazos	06caa2d560	[Dynamo] Simplify torch function mode stack guard (#135444 ) The semantics of ignored modes previously had edge cases, this eliminates these by in essence filtering any ignored modes out of both the ref stack and the current torch function mode stack. This is purely to fix complexity in #135422. The ignored modes handling will be removed in a future PR after https://github.com/pytorch/pytorch/pull/135422 lands, since we will then trace through DeviceContexts vs inserting them into the graph which needed these extra workarounds for correctness. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135444 Approved by: https://github.com/anijain2305, https://github.com/williamwen42 ghstack dependencies: #134732, #133137, #135443	2024-09-14 18:52:22 +00:00
Michael Lazos	14cabdf626	[Dynamo] Support thread local setattr (#135443 ) In preparation for tracing through DeviceContext (`defb515306/torch/utils/_device.py (L66)`) This PR adds support for calling the setattr of thread local objects. These objects have a slots impl, and since this doesn't appear to have any side effects, we call this setattr impl when replaying mutations, since calling `object.__setattr__` on these objects results in a type error. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135443 Approved by: https://github.com/anijain2305 ghstack dependencies: #134732, #133137	2024-09-14 18:52:22 +00:00
Michael Lazos	5c5c33ac32	[Dynamo] Trace torch function modes entered outside of torch.compile (#133137 ) This PR adds initial tracing for torch function modes. Details: In essence, this adds tracing into the torch function of modes entered outside of the torch.compile call. This does not yet support tracing enter/exit of a torch function mode/ tracing set_default_device properly using the new mode infra (this will be a very good stress test for modes). I am adding more PRs to this stack to support these. The overall plan is to support tracing enter/exit and handling graph breaks like we do other torch.* context managers. Previously landed: https://github.com/pytorch/pytorch/pull/133135 https://github.com/pytorch/pytorch/pull/133136 https://github.com/pytorch/pytorch/pull/133134 https://github.com/pytorch/pytorch/pull/133133 https://github.com/pytorch/pytorch/pull/133132 https://github.com/pytorch/pytorch/pull/133131 https://github.com/pytorch/pytorch/pull/133729 https://github.com/pytorch/pytorch/pull/133130 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133137 Approved by: https://github.com/jansel, https://github.com/zou3519 ghstack dependencies: #134732	2024-09-14 18:52:22 +00:00
Michael Lazos	228760b945	[Dynamo] Use custom backend to reenter metadata tf mode when tracing while/cond (#134732 ) For tracing cond/while in eager, we trace the HOP with the eager backend with metadata torchfunction mode enabled. HOPs disallow the mutation that occurs in this torch function mode, so it is not able to be traced. As a result, we use a custom backend which enters this mode for tracing these HOPs. Thanks to @ydwu4 for the help with implementing this Pull Request resolved: https://github.com/pytorch/pytorch/pull/134732 Approved by: https://github.com/ydwu4	2024-09-14 18:52:22 +00:00
Bin Bao	b4c84c3167	[AOTI] Fix a fallback op returning None issue (#135997 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/135781. In some cases, a fallback can return None in the place of a tensor. Differential Revision: [D62659039](https://our.internmc.facebook.com/intern/diff/D62659039) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135997 Approved by: https://github.com/chenyang78	2024-09-14 18:12:06 +00:00
Laith Sakka	b82122beef	Only keep ListOfLinears module in basic_modules_benchmarks and add gpu version. (#135730 ) All of the previous benchmarks are similar, ListOfLinears should be representative enough. I copied the previous benchmarks from unit tests without an intention, was just trying to create a large number of benchmarks to better observe noise. This PR keeps only one, we can add more as we see value and regressions in the future. Also this diff adds a GPU version. ``` collecting compile time instruction count for basic_modules_ListOfLinears_eager compile time instruction count for iteration 0 is 6479525851 compile time instruction count for iteration 1 is 1024432680 compile time instruction count for iteration 2 is 1019417317 compile time instruction count for iteration 3 is 1013603566 compile time instruction count for iteration 4 is 1008853980 compile time instruction count for iteration 5 is 1009541481 compile time instruction count for iteration 6 is 1005025533 compile time instruction count for iteration 7 is 1004116323 compile time instruction count for iteration 8 is 1000828633 compile time instruction count for iteration 9 is 999788323 collecting compile time instruction count for basic_modules_ListOfLinears_inductor compile time instruction count for iteration 0 is 40837529730 compile time instruction count for iteration 1 is 18411921909 compile time instruction count for iteration 2 is 18383665161 compile time instruction count for iteration 3 is 18348983522 compile time instruction count for iteration 4 is 18349276590 compile time instruction count for iteration 5 is 18353046274 compile time instruction count for iteration 6 is 18346818581 compile time instruction count for iteration 7 is 18340057998 compile time instruction count for iteration 8 is 18331267320 compile time instruction count for iteration 9 is 18328381338 collecting compile time instruction count for basic_modules_ListOfLinears_inductor_gpu compile time instruction count for iteration 0 is 15408870979 compile time instruction count for iteration 1 is 10949520859 compile time instruction count for iteration 2 is 11058786167 compile time instruction count for iteration 3 is 11003606719 compile time instruction count for iteration 4 is 10896406770 compile time instruction count for iteration 5 is 10982875189 compile time instruction count for iteration 6 is 10931848275 compile time instruction count for iteration 7 is 10956345008 compile time instruction count for iteration 8 is 11045384499 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/135730 Approved by: https://github.com/ezyang, https://github.com/anijain2305	2024-09-14 16:45:52 +00:00
Suresh Babu Kolla	b8637503c0	[Pytorch] Consolidate Strobelight compile time profiler between OSS and fbcode (#135953 ) Summary: Move towards consolidating strobelight profiler implementations between OSS and fbcode. This change is a first step towards that. - Created a new function to abstract out compile time profiling enablement. This function allows profiler to switch between different function profilers (e.g. Thrift based or CLI based) - Both OSS and Fbcode now use one compile time profiler in torch/_strobelight Test Plan: Tested OSS with following commands: ``` python torch/_strobelight/examples/compile_time_profile_example.py python torch/_strobelight/examples/cli_function_profiler_example.py TORCH_COMPILE_STROBELIGHT=TRUE TORCHINDUCTOR_FORCE_DISABLE_CACHES=1 python benchmarks/dynamo/huggingface.py --ci --accuracy --timing --explain --inductor --device cuda --training --amp --only XLNetLMHeadModel ``` See test commands for fbcode in comments. Differential Revision: D62444551 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135953 Approved by: https://github.com/laithsakka	2024-09-14 16:35:22 +00:00
William Wen	f97cccf62a	[3.13] fix 3.13 pickle error in torch/package (#136049 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136049 Approved by: https://github.com/albanD ghstack dependencies: #136034	2024-09-14 14:28:09 +00:00
CaoE	db393fb95e	Add Half support for reflection and replication padding on CPU (#135931 ) Fixes #135680 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135931 Approved by: https://github.com/Skylion007	2024-09-14 14:18:55 +00:00
PyTorch MergeBot	23dec79cef	Revert "[Dynamo] Use custom backend to reenter metadata tf mode when tracing while/cond (#134732 )" This reverts commit 731b178b56c83966d6e8cdfb0015d22d8f91b4d2. Reverted https://github.com/pytorch/pytorch/pull/134732 on behalf of https://github.com/mlazos due to broke python test/quantization/pt2e/test_numeric_debugger.py TestNumericDebugger.test_re_export_preserve_handle modified yesterday ([comment](https://github.com/pytorch/pytorch/pull/134732#issuecomment-2350937008))	2024-09-14 10:02:55 +00:00
PyTorch MergeBot	8c8a3086a7	Revert "[Dynamo] Trace torch function modes entered outside of torch.compile (#133137 )" This reverts commit 4528777e034b157a8329d1879daf52290eea199a. Reverted https://github.com/pytorch/pytorch/pull/133137 on behalf of https://github.com/mlazos due to broke python test/quantization/pt2e/test_numeric_debugger.py TestNumericDebugger.test_re_export_preserve_handle modified yesterday ([comment](https://github.com/pytorch/pytorch/pull/134732#issuecomment-2350937008))	2024-09-14 10:02:55 +00:00
PyTorch MergeBot	46f5037007	Revert "[Dynamo] Support thread local setattr (#135443 )" This reverts commit 149d0b716173787df4543186ff74b605aca54e3e. Reverted https://github.com/pytorch/pytorch/pull/135443 on behalf of https://github.com/mlazos due to broke python test/quantization/pt2e/test_numeric_debugger.py TestNumericDebugger.test_re_export_preserve_handle modified yesterday ([comment](https://github.com/pytorch/pytorch/pull/134732#issuecomment-2350937008))	2024-09-14 10:02:55 +00:00
PyTorch MergeBot	7975ec3a29	Revert "[Dynamo] Simplify torch function mode stack guard (#135444 )" This reverts commit ce3c74f2744cbc134b95cf8bd53ae5e3fbc67c29. Reverted https://github.com/pytorch/pytorch/pull/135444 on behalf of https://github.com/mlazos due to broke python test/quantization/pt2e/test_numeric_debugger.py TestNumericDebugger.test_re_export_preserve_handle modified yesterday ([comment](https://github.com/pytorch/pytorch/pull/134732#issuecomment-2350937008))	2024-09-14 10:02:55 +00:00
PyTorch MergeBot	f3180f0088	Revert "[Dynamo] Trace enter/exit of TorchFunctionModes (#135422 )" This reverts commit 7743149b2be4a9eba7e0997ccdc6abe552bec266. Reverted https://github.com/pytorch/pytorch/pull/135422 on behalf of https://github.com/mlazos due to broke python test/quantization/pt2e/test_numeric_debugger.py TestNumericDebugger.test_re_export_preserve_handle modified yesterday ([comment](https://github.com/pytorch/pytorch/pull/134732#issuecomment-2350937008))	2024-09-14 10:02:55 +00:00
PyTorch MergeBot	838c912502	Revert "[Dynamo] Remove ignored modes workaround (#135502 )" This reverts commit 5c67cf180ee53d696f95d7c45dd99a35399e4450. Reverted https://github.com/pytorch/pytorch/pull/135502 on behalf of https://github.com/mlazos due to broke python test/quantization/pt2e/test_numeric_debugger.py TestNumericDebugger.test_re_export_preserve_handle modified yesterday ([comment](https://github.com/pytorch/pytorch/pull/134732#issuecomment-2350937008))	2024-09-14 10:02:55 +00:00
PyTorch MergeBot	72b868d034	Revert "[Dynamo] Remove ignored modes from torch function mode stack guard (#135503 )" This reverts commit e77bd0ebd20e96990ccd40518e68bbcfe7fda855. Reverted https://github.com/pytorch/pytorch/pull/135503 on behalf of https://github.com/mlazos due to broke python test/quantization/pt2e/test_numeric_debugger.py TestNumericDebugger.test_re_export_preserve_handle modified yesterday ([comment](https://github.com/pytorch/pytorch/pull/134732#issuecomment-2350937008))	2024-09-14 10:02:54 +00:00
Zhenbin Lin	41b58a1bec	OpenReg: Fix issue when copying on the same device (#135956 ) Current copy gets wrong value when src and dst are both openreg. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135956 Approved by: https://github.com/albanD	2024-09-14 09:57:45 +00:00
CaoE	f96a073c9d	Use _amp_foreach_non_finite_check_and_unscale_ for CPU grads of ShardedGradScaler (#135232 ) Use `_amp_foreach_non_finite_check_and_unscale_` instead of fallback version for CPU grads of `ShardedGradScaler ` as `_amp_foreach_non_finite_check_and_unscale_ ` is supported on CPU https://github.com/pytorch/pytorch/pull/109281. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135232 Approved by: https://github.com/ezyang	2024-09-14 09:53:17 +00:00
Will Feng	a815611db9	[Traceable FSDP2][Partitioner] Must save AC output if output has a backward hook (#135727 ) If node is AC region output and has a backward hook on it, we intentionally choose to save it. This is to work around circular dependencies in Traceable FSDP2+AC. Example: ``` out = fully_shard(utils.checkpoint(module))(x) norm_out = layer_norm(out) ``` and there is a circular dependency: 1. In backward, grad_input of layer_norm aka. `out_grad` is actually dependent on `out`. 2. `out` depends on `out`'s backward hook created by FSDP2 (which does all-gather for `module` weights) in order to be recomputed. 3. `out`'s FSDP2 backward hook, as is the case for all eager backward hooks, depends on `out_grad` -> circular dependency with (1)! Solution: check whether `out` has a backward hook, and if so, intentionally save `out` in forward graph outputs. With this, we can break the above circular dependency. ---- Pull Request resolved: https://github.com/pytorch/pytorch/pull/135727 Approved by: https://github.com/Chillee	2024-09-14 08:45:58 +00:00
Oguz Ulgen	3352c9ac94	Add higher order operator name to the cache bypass exception (#135876 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135876 Approved by: https://github.com/jamesjwu, https://github.com/zou3519	2024-09-14 07:05:29 +00:00
Will Feng	5a2be192d1	[Traceable FSDP2] Don't register RegisterPostBackwardFunction if user intends to use Traceable FSDP2, and assert that compiled autograd is not used when entering RegisterPostBackwardFunction (#135824 ) During enablement of Traceable FSDP2 on internal models, sometimes the user only applies torch.compile to some of the FSDP2 instances but not all of them. Such mixed usage pattern is not supported by compiled autograd. Here we try to catch and throw error at such usage pattern, so that the user can fix the usage. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135824 Approved by: https://github.com/awgu	2024-09-14 06:30:12 +00:00
Nikita Shulga	a9bef85263	[CI] Increase open file handles limit to 16K on MacOS (#136061 ) May be it will help with flaky failures tracked in https://github.com/pytorch/pytorch/issues/135885 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136061 Approved by: https://github.com/clee2000, https://github.com/kit1980, https://github.com/huydhn, https://github.com/ZainRizvi	2024-09-14 06:16:12 +00:00
Laith Sakka	44dd218a61	Disable garbage collection during compile_time_instructions count in benchmark base by default. (#135768 ) When we measure compile time instruction count, probably we do want in most cases to measure gc instructions disabling it here by default. if it is needed we can add an option to allow it, or someone can use the regular total instruction count instead of compile time instruction count. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135768 Approved by: https://github.com/ezyang, https://github.com/anijain2305	2024-09-14 06:15:28 +00:00
Nikita Shulga	1a67e2b680	[MPS] Add native im2col (#135706 ) It's called from `torch.unfold` and one of the few remaining vestiges in `MPSFallback.mm` Strongly inspired by CUDA implementation from `09519eb195/aten/src/ATen/native/cuda/im2col.cuh (L40-L61)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/135706 Approved by: https://github.com/albanD	2024-09-14 06:09:36 +00:00
Jack Taylor	b9b6094793	[ROCm] Skip pointwise associative scan tests due to regression (#135995 ) https://github.com/pytorch/pytorch/pull/133012 caused a regression on ROCm causing pointwise scan tests to fail ``` ERROR: test_pointwise_associative_scan_tuple_reverse_True_combine_mode_pointwise_cuda ERROR: test_pointwise_associative_scan_tuple_reverse_False_combine_mode_pointwise_cuda ERROR: test_pointwise_associative_scan_complex_pytree_reverse_True_combine_mode_pointwise_cuda ERROR: test_pointwise_associative_scan_complex_pytree_reverse_False_combine_mode_pointwise_cuda ERROR: test_pointwise_associative_scan_binary_operator_reverse_True_combine_mode_pointwise_cuda ERROR: test_pointwise_associative_scan_binary_operator_reverse_False_combine_mode_pointwise_cuda ``` Skipping temporarily while triage is underway. Full log: https://ossci-raw-job-status.s3.amazonaws.com/log/30067645445 ``` File "/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/_inductor/graph.py", line 1020, in call_function out = lowerings[target](args, kwargs) # type: ignore[index] File "/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/_inductor/lowering.py", line 363, in wrapped out = decomp_fn(args, **kwargs) File "/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/_inductor/lowering.py", line 6245, in associative_scan raise RuntimeError("Unable to generate code for associative_scan op") torch._inductor.exc.LoweringException: RuntimeError: Unable to generate code for associative_scan op ``` NOTE: even "eager" backend fails ``` File "/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/_higher_order_ops/associative_scan.py", line 338, in associative_scan_op_dense raise NotImplementedError("associative_scan is not implemented for eager") NotImplementedError: associative_scan is not implemented for eager ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/135995 Approved by: https://github.com/malfet	2024-09-14 05:40:10 +00:00
fduwjj	911a43f930	[TCPStore] Remove deprecated constructor (#136004 ) While looking at TCPStore code again and found it confusing that we still keep the deprecated constructor for TCPStore in cpp while we don't expose it in python via pybind already. I checked both internal and external, all use cases in cpp (aside from unit test fixed in this PR) already moved to using option. So let's remove this legacy constructor to avoid confusion. Differential Revision: [D62653634](https://our.internmc.facebook.com/intern/diff/D62653634) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136004 Approved by: https://github.com/Skylion007, https://github.com/XilunWu	2024-09-14 04:25:47 +00:00
Michael Lazos	e77bd0ebd2	[Dynamo] Remove ignored modes from torch function mode stack guard (#135503 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135503 Approved by: https://github.com/anijain2305 ghstack dependencies: #134732, #133137, #135443, #135444, #135422, #135502	2024-09-14 02:41:16 +00:00
Michael Lazos	5c67cf180e	[Dynamo] Remove ignored modes workaround (#135502 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135502 Approved by: https://github.com/anijain2305 ghstack dependencies: #134732, #133137, #135443, #135444, #135422	2024-09-14 02:41:16 +00:00
Michael Lazos	7743149b2b	[Dynamo] Trace enter/exit of TorchFunctionModes (#135422 ) This PR implements tracing of with contexts with TorchFunction modes which have the default enter/exit behavior (ie pushing/popping the mode) Typically the bytecode for a context manager looks like this during a graph break: 1. graph call 2. enter context 3. unsupported code 4. exit context 5. resume call resume fn structure: 1. enter context 2. jump ... 3. exit context The issue with torch function modes is that side effects will replay any mutations to the torch function stack performed during tracing. So, we do not need to enter and exit around the unsupported code in the original function (doing so would result in a duplicate torch function mode entry during execution of the unsupported code), and we don't need to enter again in the resume function (the mode that was pushed from the side effects bytecode would still be on the stack). So for torch function modes the structure of our output code is this: 1. graph call 2. mutate tf mode stack to replay mutations 4. unsupported code 5. on exception restore stack 6. resume function Then our resume fn looks like this: 1. no-op enter torch function mode 2. jump 3. exit tf mode To implement the no-op enter of the torch function mode I added torch function mode in polyfill which no-op enters, but normally exits. This is needed because we still want to trace the with context in the resume function, and exit properly (the exit instructions will still be in the function, so we need to generate instructions to set up the context). Separately from the bytecode, dynamo also tracks contexts on the block stack, which is how the SETUP_* instructions are implemented. Naturally at a graph break, we exit these block stacks to properly reset the contexts entirely, so that we can re-enter around the unsupported code soundly. However once again, in the torch function mode case, in the event of a graph we do not want to perform any exit side effects because we want to preserve the state of the mode stack as is so that we will properly update the stack with bytecode mentioned in the first section. If we exited here, dynamo would pop the mode off of the symbolic stack, and not update the true python torch function mode stack with the suffix bytecode. All in all, for torch function modes we enter exactly once, update the global torch function mode stack with side effects bytecode, re-read this stack when compiling the resume function, and exit exactly once in the resume function. This matches the semantics of eager exactly. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135422 Approved by: https://github.com/williamwen42 ghstack dependencies: #134732, #133137, #135443, #135444	2024-09-14 02:41:08 +00:00
Michael Lazos	ce3c74f274	[Dynamo] Simplify torch function mode stack guard (#135444 ) The semantics of ignored modes previously had edge cases, this eliminates these by in essence filtering any ignored modes out of both the ref stack and the current torch function mode stack. This is purely to fix complexity in #135422. The ignored modes handling will be removed in a future PR after https://github.com/pytorch/pytorch/pull/135422 lands, since we will then trace through DeviceContexts vs inserting them into the graph which needed these extra workarounds for correctness. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135444 Approved by: https://github.com/anijain2305, https://github.com/williamwen42 ghstack dependencies: #134732, #133137, #135443	2024-09-14 02:40:59 +00:00
Michael Lazos	149d0b7161	[Dynamo] Support thread local setattr (#135443 ) In preparation for tracing through DeviceContext (`defb515306/torch/utils/_device.py (L66)`) This PR adds support for calling the setattr of thread local objects. These objects have a slots impl, and since this doesn't appear to have any side effects, we call this setattr impl when replaying mutations, since calling `object.__setattr__` on these objects results in a type error. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135443 Approved by: https://github.com/anijain2305 ghstack dependencies: #134732, #133137	2024-09-14 02:40:52 +00:00
Michael Lazos	4528777e03	[Dynamo] Trace torch function modes entered outside of torch.compile (#133137 ) This PR adds initial tracing for torch function modes. Details: In essence, this adds tracing into the torch function of modes entered outside of the torch.compile call. This does not yet support tracing enter/exit of a torch function mode/ tracing set_default_device properly using the new mode infra (this will be a very good stress test for modes). I am adding more PRs to this stack to support these. The overall plan is to support tracing enter/exit and handling graph breaks like we do other torch.* context managers. Previously landed: https://github.com/pytorch/pytorch/pull/133135 https://github.com/pytorch/pytorch/pull/133136 https://github.com/pytorch/pytorch/pull/133134 https://github.com/pytorch/pytorch/pull/133133 https://github.com/pytorch/pytorch/pull/133132 https://github.com/pytorch/pytorch/pull/133131 https://github.com/pytorch/pytorch/pull/133729 https://github.com/pytorch/pytorch/pull/133130 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133137 Approved by: https://github.com/jansel, https://github.com/zou3519 ghstack dependencies: #134732	2024-09-14 02:40:43 +00:00
Michael Lazos	731b178b56	[Dynamo] Use custom backend to reenter metadata tf mode when tracing while/cond (#134732 ) For tracing cond/while in eager, we trace the HOP with the eager backend with metadata torchfunction mode enabled. HOPs disallow the mutation that occurs in this torch function mode, so it is not able to be traced. As a result, we use a custom backend which enters this mode for tracing these HOPs. Thanks to @ydwu4 for the help with implementing this Pull Request resolved: https://github.com/pytorch/pytorch/pull/134732 Approved by: https://github.com/ydwu4	2024-09-14 02:40:32 +00:00
PyTorch MergeBot	1786a17fed	Revert "Use _amp_foreach_non_finite_check_and_unscale_ for CPU grads of ShardedGradScaler (#135232 )" This reverts commit 51c52061339069a2162e921e5b464fad5a411522. Reverted https://github.com/pytorch/pytorch/pull/135232 on behalf of https://github.com/CaoE due to wrong commit ([comment](https://github.com/pytorch/pytorch/pull/135232#issuecomment-2350792806))	2024-09-14 02:31:06 +00:00
CaoE	51c5206133	Use _amp_foreach_non_finite_check_and_unscale_ for CPU grads of ShardedGradScaler (#135232 ) Use `_amp_foreach_non_finite_check_and_unscale_` instead of fallback version for CPU grads of `ShardedGradScaler ` as `_amp_foreach_non_finite_check_and_unscale_ ` is supported on CPU https://github.com/pytorch/pytorch/pull/109281. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135232 Approved by: https://github.com/ezyang	2024-09-14 02:20:58 +00:00
Yu, Guangye	2e8d431a8f	Fix tensor.data_ptr() representation overflow (#135567 ) # Motivation fix https://github.com/pytorch/pytorch/issues/135550 In PyTorch, [`tensor.data_ptr()`](`e889252493/tools/autograd/templates/python_variable_methods.cpp (L204)`) is reinterpreted by a [signed int64](`e889252493/torch/csrc/autograd/utils/wrap_outputs.h (L50)`) data type, which could result in an overflow issue, like below: ```python import torch a = torch.randn(2).to('xpu') a.data_ptr() # one possible output is -23453392437248 # this is inconsistent with storage.data_ptr() a.untyped_storage().data_ptr() # one possible output is 18446720620317114368 ``` This PR aims to fix this representation overflow issue to make `tensor.data_ptr()` consistent with [`tensor.untyped_storage().data_ptr()`](`c0d2f991b1/torch/csrc/StorageMethods.cpp (L62)`). With this PR, the output will become: ```python import torch a = torch.randn(2).to('xpu') a.data_ptr() # one possible output is 18446720620317114368 # this is consistent with storage.data_ptr() a.untyped_storage().data_ptr() # one possible output is 18446720620317114368 ``` # Solution Use `PyLong_FromVoidPtr` to prevent the overflow issue and fit the semantic of `wrap`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135567 Approved by: https://github.com/dvrogozh, https://github.com/EikanWang, https://github.com/albanD	2024-09-14 01:52:04 +00:00
Nikita Shulga	95496e4855	[CI] Check that PyTorch is built with OpenMP (#136060 ) Restriction for x86 only builds should have been removed long time ago Pull Request resolved: https://github.com/pytorch/pytorch/pull/136060 Approved by: https://github.com/clee2000, https://github.com/kit1980, https://github.com/ZainRizvi	2024-09-14 01:51:36 +00:00
Li, Xingyuan	5de4cb8cd8	[Inductor UT] Generalize inductor UT for intel GPU (Part 3) (#135827 ) [Inductor UT] Reuse Inductor test case for Intel GPU. Reuse `test/inductor/test_compiled_autograd.py` Pull Request resolved: https://github.com/pytorch/pytorch/pull/135827 Approved by: https://github.com/etaf, https://github.com/desertfire	2024-09-14 01:43:05 +00:00
Joel Schlosser	06bc717410	Fix sum() forward for NJT (#131945 ) This PR solves two problems with `sum()` support in NJT: * `sum()` over a dim with `keepdim=True` returns the wrong shape (i.e. it'll keep the wrong dim). This is a long-standing bug from way back in #112519. * Historically, we've only supported `sum()` over a dim and not a full reduction. This PR adds the full reduction form (forward only, backward still fails). Pull Request resolved: https://github.com/pytorch/pytorch/pull/131945 Approved by: https://github.com/davidberard98, https://github.com/jananisriram	2024-09-14 00:58:03 +00:00
Nikita Shulga	081c4a966d	[BE] Use squeeze/unsqueeze in im2col (#136006 ) And move unsqeeze out of the dispatch, as it's dtype agnostic Pull Request resolved: https://github.com/pytorch/pytorch/pull/136006 Approved by: https://github.com/Skylion007, https://github.com/eqy	2024-09-14 00:35:37 +00:00
Ke Wen	4237592b8f	[Distributed] add pack-check method for float8_e4m3fn (#135961 ) We check 8 x FP8 simultaneously, at size of 8 bytes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135961 Approved by: https://github.com/yifuwang, https://github.com/Skylion007 ghstack dependencies: #135891	2024-09-14 00:32:27 +00:00
William Wen	a00faf4408	[3.13] fix 3.13 pickle error in serialization.py (#136034 ) Error encountered when adding dynamo 3.13 support. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136034 Approved by: https://github.com/albanD	2024-09-14 00:02:40 +00:00
eellison	b608ff3bea	[Easy] Dont match to mm_plus_mm if not in max autotune (#135929 ) It's only an optimization when we tune the triton template. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135929 Approved by: https://github.com/FindHao	2024-09-13 23:38:02 +00:00
Jerry Zhang	b8eef500a6	Fix attr check for quantization spec (#135736 ) Summary: Previously we only checked dtype and is_dynamic to decide if two quantization spec are equivalent this may not work in some cases, e.g. when people use different qscheme or quant_min/quant_max This PR added checks for other fields as well Test Plan: regression tests Reviewers: Subscribers: Tasks: Tags: Differential Revision: [D62530974](https://our.internmc.facebook.com/intern/diff/D62530974) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135736 Approved by: https://github.com/sxu	2024-09-13 23:01:22 +00:00
Menglu Yu	aad556a0b5	[PT2][Inductor][Optimus] Fix a corner case in remove_split_with_size_one (#135962 ) Summary: see context in https://fb.workplace.com/groups/1075192433118967/permalink/1501768230461383/ Test Plan: # local reproduce ``` CUDA_VISIBLE_DEVICES=3 OC_CAUSE=1 buck2 run mode/opt //scripts/jackiexu0313/pt2:local_model_with_pt2 -- --test_mode batch-split --model_type "mai" --flow_id 642153776 ``` P1586356950 # e2e before fix f642153776 after fix Differential Revision: D62625318 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135962 Approved by: https://github.com/jackiexu1992	2024-09-13 22:53:08 +00:00
Zain Rizvi	3c5d44dda5	Cleanup unused runner variants (#136058 ) Cleaning up unused runner variants, leaving behind only the few that are actually referenced by workflows For more details see description in the PR that generated these code changes: - https://github.com/pytorch/test-infra/pull/5665 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136058 Approved by: https://github.com/wdvr, https://github.com/malfet	2024-09-13 22:50:07 +00:00
Justin Chu	e2d3af405f	[ONNX] Remove logging apis from public (#133825 ) Remove - torch.onnx.enable_log - torch.onnx.disable_log - torch.onnx.set_log_stream - torch.onnx.log Because they are not meant for public consumption and has been marked for deprecation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133825 Approved by: https://github.com/titaiwangms	2024-09-13 22:19:52 +00:00
Jessica Vandebon	baff86dafb	[MTIA tensor] allow shallow copy between CPU and MTIA tensors (#135871 ) Reviewed By: egienvalue, hanzlfs Differential Revision: D61662214 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135871 Approved by: https://github.com/egienvalue, https://github.com/nautsimon	2024-09-13 22:13:58 +00:00
Huy Do	db5e1b44d2	Fix inductor-micro-benchmark results upload (take 2) (#136052 ) I had a brain freeze when I wrote the original fix. The parameters were in the wrong order. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136052 Approved by: https://github.com/clee2000, https://github.com/kit1980, https://github.com/malfet	2024-09-13 22:05:10 +00:00
Nikita Shulga	a30d5ba16c	Fix bug in split-build workflows codegen (#136043 ) By just deleting a few rogue lines left out in https://github.com/pytorch/pytorch/pull/135510 If file in workflows folder does not have a `.yml` extensions it will not be launched at all, will it? Pull Request resolved: https://github.com/pytorch/pytorch/pull/136043 Approved by: https://github.com/kit1980, https://github.com/atalman	2024-09-13 21:29:06 +00:00
Laith Sakka	46935c8241	Reduce default iterations to 5 . (#135773 ) running all benchmarks takes around 15 mins rn, this is the data https://www.internalfb.com/phabricator/paste/view/P1583590240 the data looks mostly stable, and 5 iterations should be good, specially with our 1.5% threshold. that said, the diff also add a way to increase the number of iterations for a specific benchmark. after the change results https://www.internalfb.com/phabricator/paste/view/P1583618969 time is down to half (7 mins) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135773 Approved by: https://github.com/ezyang, https://github.com/anijain2305	2024-09-13 21:16:38 +00:00
Laith Sakka	4f407c1884	Only measure compile time instruction count for sum_floordiv benchmark (#135785 ) there was a recent strange noise +5%, -5%. using only compile time : 1) avoid gc time . 2) avoid other operations that are not what we try to measure by this. ==> less probable noise. ``` collecting compile time instruction count for sum_floordiv_regression compile time instruction count for iteration 0 is 8899290248 compile time instruction count for iteration 1 is 1188830489 compile time instruction count for iteration 2 is 1180579615 compile time instruction count for iteration 3 is 1176263131 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/135785 Approved by: https://github.com/avikchaudhuri, https://github.com/anijain2305	2024-09-13 21:14:10 +00:00
Laith Sakka	2e461e54e8	Add gpu and gpu_dynamic versions of add_loop (#135809 ) I am thinking maybe 3 iterations are enough for this one? - so I am keeping eager and inductor since inductor is 2X eager time - Eager dynamic is 2X eager so keeping this as well. - inductor have three tests. (dynamic gpu, gpu and cpu) I am unsure if am over profiling here happy to trim if anyone have suggestions. ``` collecting compile time instruction count for add_loop_eager compile time instruction count for iteration 0 is 8213664211 compile time instruction count for iteration 1 is 2798628246 compile time instruction count for iteration 2 is 2796811362 compile time instruction count for iteration 3 is 2794438188 compile time instruction count for iteration 4 is 2794634117 collecting compile time instruction count for add_loop_eager_dynamic compile time instruction count for iteration 0 is 5724108021 compile time instruction count for iteration 1 is 5499908609 compile time instruction count for iteration 2 is 5569101366 compile time instruction count for iteration 3 is 5493806364 compile time instruction count for iteration 4 is 5493169851 collecting compile time instruction count for add_loop_inductor compile time instruction count for iteration 0 is 49789381222 compile time instruction count for iteration 1 is 25769347393 compile time instruction count for iteration 2 is 25772594322 compile time instruction count for iteration 3 is 25768695952 compile time instruction count for iteration 4 is 25768032314 collecting compile time instruction count for add_loop_inductor_gpu compile time instruction count for iteration 0 is 23966942581 compile time instruction count for iteration 1 is 23771950919 compile time instruction count for iteration 2 is 23770784286 compile time instruction count for iteration 3 is 23780160875 compile time instruction count for iteration 4 is 23774634465 collecting compile time instruction count for add_loop_inductor_dynamic_gpu compile time instruction count for iteration 0 is 41505055086 compile time instruction count for iteration 1 is 41293654089 compile time instruction count for iteration 2 is 41301016100 compile time instruction count for iteration 3 is 41306056207 compile time instruction count for iteration 4 is 41308171566 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/135809 Approved by: https://github.com/ezyang, https://github.com/anijain2305	2024-09-13 20:42:31 +00:00
atalman	a3d827a28c	Use python 3.11 for Large Wheel build (#136042 ) Use Python 3.11 in nightly Large wheel builds. Required for Colab testing Pull Request resolved: https://github.com/pytorch/pytorch/pull/136042 Approved by: https://github.com/kit1980, https://github.com/malfet Co-authored-by: Sergii Dymchenko <kit1980@gmail.com>	2024-09-13 20:27:11 +00:00
Yiming Zhou	4312794b92	[reland][export] fix re-export custom metadata (#135720 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/134778 The previous D62304294 broke some executorch tests. It has already been reverted. In this diff, `_collect_param_buffer_metadata()` is modified in a way that when a `call_function` node is encountered and its input nodes include `get_attr`. We skip the fields that have been collected previously and only collect rest of the fields. This prevents over-writing. Test Plan: ``` buck2 test 'fbcode//mode/dev-nosan' fbcode//executorch/backends/xnnpack/test:test_xnnpack_ops buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/quantization:test_quantization -- -r test_re_export_preserve_handle buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/quantization:test_quantization -- -r test_run_decompositions_preserve_handle ``` Differential Revision: D62514208 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135720 Approved by: https://github.com/zhxchen17, https://github.com/jerryzh168	2024-09-13 20:15:15 +00:00
Sergii Dymchenko	b856f3539b	Fix script name in the comments (#135507 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135507 Approved by: https://github.com/atalman	2024-09-13 19:59:47 +00:00
Jing Xu	835e7bb077	fix requirements.txt installation failure issue on Windows (#134567 ) Fixes #134564 Root cause: The `lintrunner` wheel released on [pypi.org](https://pypi.org/project/lintrunner/#files) only supports Windows 32bit and Linux 64bit. Since compilation of pytorch requires a 64bit env, on windows, the `lintrunner` has to be compiled from source distribution. `Rust` is its dependency for compilation, as indicated in the error message. Meanwhile, Visual Studio environment is needed for linking libraries.. ![image](https://github.com/user-attachments/assets/180cd899-8886-43b5-b42f-031f41e81683) Issue when performing `pip install lintrunner` without a Visual Studio environment activated is shown below. ```bash >python -m pip install lintrunner Collecting lintrunner Downloading lintrunner-0.12.5.tar.gz (62 kB) Installing build dependencies ... done Getting requirements to build wheel ... done Preparing metadata (pyproject.toml) ... done Building wheels for collected packages: lintrunner Building wheel for lintrunner (pyproject.toml) ... error error: subprocess-exited-with-error × Building wheel for lintrunner (pyproject.toml) did not run successfully. │ exit code: 1 ╰─> [137 lines of output] Running `maturin pep517 build-wheel -i C:\Users\\miniforge3\envs\py310\python.exe --compatibility off` ðŸ“¡ Using build options bindings from pyproject.toml Compiling proc-macro2 v1.0.79 Compiling unicode-ident v1.0.12 Compiling version_check v0.9.4 Compiling windows_x86_64_msvc v0.52.4 Compiling winapi v0.3.9 Compiling serde v1.0.197 Compiling autocfg v1.2.0 Compiling syn v1.0.109 Compiling lazy_static v1.4.0 Compiling libc v0.2.153 Compiling equivalent v1.0.1 Compiling hashbrown v0.14.3 Compiling memchr v2.7.2 Compiling yansi v1.0.1 Compiling unicode-width v0.1.11 Compiling regex-syntax v0.8.3 Compiling encode_unicode v0.3.6 Compiling cfg-if v1.0.0 Compiling winnow v0.6.5 Compiling cc v1.0.92 error: could not compile `windows_x86_64_msvc` (build script) due to 2 previous errors warning: build failed, waiting for other jobs to finish... error: could not compile `serde` (build script) due to 2 previous errors error: could not compile `proc-macro2` (build script) due to 2 previous errors error: could not compile `syn` (build script) due to 2 previous errors error: could not compile `libc` (build script) due to 2 previous errors error: could not compile `winapi` (build script) due to 2 previous errors ðŸ’¥ maturin failed Caused by: Failed to build a native library through cargo Caused by: Cargo build finished with "exit code: 101": `cargo rustc --manifest-path Cargo.toml --message-format json --release --bins --` ðŸ“¦ Including license file "LICENSE" ðŸ”— Found bin bindings error: linker `link.exe` not found \| = note: program not found note: the msvc targets depend on the msvc linker but `link.exe` was not found note: please ensure that Visual Studio 2017 or later, or Build Tools for Visual Studio were installed with the Visual C++ option. note: VS Code is a different product, and is not sufficient. error: aborting due to 1 previous error error: linker `link.exe` not found \| = note: program not found note: the msvc targets depend on the msvc linker but `link.exe` was not found note: please ensure that Visual Studio 2017 or later, or Build Tools for Visual Studio were installed with the Visual C++ option. note: VS Code is a different product, and is not sufficient. error: aborting due to 1 previous error error: linker `link.exe` not found \| = note: program not found note: the msvc targets depend on the msvc linker but `link.exe` was not found note: please ensure that Visual Studio 2017 or later, or Build Tools for Visual Studio were installed with the Visual C++ option. note: VS Code is a different product, and is not sufficient. error: aborting due to 1 previous error error: linker `link.exe` not found \| = note: program not found note: the msvc targets depend on the msvc linker but `link.exe` was not found note: please ensure that Visual Studio 2017 or later, or Build Tools for Visual Studio were installed with the Visual C++ option. note: VS Code is a different product, and is not sufficient. error: aborting due to 1 previous error error: linker `link.exe` not found \| = note: program not found note: the msvc targets depend on the msvc linker but `link.exe` was not found note: please ensure that Visual Studio 2017 or later, or Build Tools for Visual Studio were installed with the Visual C++ option. note: VS Code is a different product, and is not sufficient. error: aborting due to 1 previous error error: linker `link.exe` not found \| = note: program not found note: the msvc targets depend on the msvc linker but `link.exe` was not found note: please ensure that Visual Studio 2017 or later, or Build Tools for Visual Studio were installed with the Visual C++ option. note: VS Code is a different product, and is not sufficient. error: aborting due to 1 previous error Error: command ['maturin', 'pep517', 'build-wheel', '-i', 'C:\\Users\\\\miniforge3\\envs\\py310\\python.exe', '--compatibility', 'off'] returned non-zero exit status 1 [end of output] note: This error originates from a subprocess, and is likely not a problem with pip. ERROR: Failed building wheel for lintrunner Failed to build lintrunner ERROR: ERROR: Failed to build installable wheels for some pyproject.toml based projects (lintrunner) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/134567 Approved by: https://github.com/malfet	2024-09-13 18:43:55 +00:00
PyTorch MergeBot	b6d6aa49b8	Revert "Validate input types for `torch.nn.Linear` and `torch.nn.Bilinear` (#135596 )" This reverts commit e157ce3ebbb3f30d008c15914e82eb74217562f0. Reverted https://github.com/pytorch/pytorch/pull/135596 on behalf of https://github.com/malfet due to It's too restrictive, should allow other int-like types, such as `numpy.int64` ([comment](https://github.com/pytorch/pytorch/pull/135596#issuecomment-2349714104))	2024-09-13 18:06:56 +00:00
PyTorch MergeBot	deee21cb78	Revert "[Inductor] Rename `cpp_wrapper_cuda.py` as `cpp_wrapper_gpu.py` (#135313 )" This reverts commit 16b37b309f64ddd4e498c57a99191e1d9b3dfdac. Reverted https://github.com/pytorch/pytorch/pull/135313 on behalf of https://github.com/izaitsevfb due to breaks internal builds ([comment](https://github.com/pytorch/pytorch/pull/135313#issuecomment-2349662091))	2024-09-13 17:53:21 +00:00
Daohang Shi	3f69410976	[gpu-profiler] Expose active and repeat in os env var (#135757 ) Summary: https://fb.workplace.com/groups/ai.efficiency.tools.users/permalink/1855136444971825/ Test Plan: `buck2 test mode/opt caffe2/test:profiler -- -r test_kineto_profiler_api ` eyes Differential Revision: D62529249 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135757 Approved by: https://github.com/Yuzhen11	2024-09-13 17:48:27 +00:00
PyTorch MergeBot	18f9331e5d	Revert "[aoti] Fix workspace generation for triton (#135552 )" This reverts commit d3833253928f29ed760b2dccac2b730028a868ca. Reverted https://github.com/pytorch/pytorch/pull/135552 on behalf of https://github.com/izaitsevfb due to blocks revert of #135313, internal failures, see D62511427 ([comment](https://github.com/pytorch/pytorch/pull/135552#issuecomment-2349641372))	2024-09-13 17:47:36 +00:00
Catherine Lee	bc0f330169	[trymerge] Manually close merged PR when Github fails (#135890 ) Manually close merged PR when Github fails to do it. Consequences of current design: Sleeping for 1 min uses up the machine, might result in race conditions, results in merging label to removed a bit later, pr still left open if this api fails too (ie no async clean up job) Tested in https://github.com/malfet/deleteme/pull/92 by removing the part of the commit message that has "resolved #pr num" Pull Request resolved: https://github.com/pytorch/pytorch/pull/135890 Approved by: https://github.com/malfet, https://github.com/huydhn	2024-09-13 17:29:24 +00:00
Rachel Guo	7834c0bb2c	[AOTI][Tooling] Add stats summary (mean/min/max, etc) for jit inductor tensor value printing (#135887 ) Summary: As title. Follow up to add stats summary (mean/min/max, etc) for jit inductor tensor value printing as well. The inductor python wrapper code level printing would look something like this: {F1859224287} Test Plan: CI Reviewed By: chenyang78 Differential Revision: D62415575 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135887 Approved by: https://github.com/chenyang78	2024-09-13 17:19:25 +00:00
PyTorch MergeBot	6ef49fe8f1	Revert "Pass ideep:lowp_kind to matmul_forward::compute on cache misses (#135058 )" This reverts commit 3d2431380999252d5401f83d5010b398a32e7597. Reverted https://github.com/pytorch/pytorch/pull/135058 on behalf of https://github.com/malfet due to It regresses x86 performance ([comment](https://github.com/pytorch/pytorch/pull/135058#issuecomment-2349480861))	2024-09-13 17:09:45 +00:00
Jack Taylor	a15774563b	[ROCm] Enable ROCm support for inductor's dynamic_rblock_scaling (#129663 ) As of ROCm 6.1 [hipDeviceProp_t::regsPerMultiprocessor](https://rocm.docs.amd.com/projects/HIP/en/latest/doxygen/html/structhip_device_prop__t.html#a7390d5b180d63978c81aa971060270b4) is now available allowing us to enable this attribute on ROCm. ``` >>> torch.cuda.get_device_properties(0) _CudaDeviceProperties(name='AMD Instinct MI250X/MI250', major=9, minor=0, gcnArchName='gfx90a:sramecc+:xnack-', total_memory=65520MB, multi_processor_count=104) >>> torch.cuda.get_device_properties(0).regs_per_multiprocessor 65536 ``` With https://github.com/triton-lang/triton/pull/3962we can extract n_regs and n_spells from a triton binary with AMD backend allowing us to enable inductor's dynamic_rblock_scaling on ROCm initially implemented in https://github.com/pytorch/pytorch/pull/115094 Leaving this in draft until following PRs have landed: - https://github.com/pytorch/pytorch/pull/129361 to bump the triton commit pin - https://github.com/pytorch/pytorch/pull/128449 to allow us to grab warp_size from device properties instead of hard coding 64 on ROCm. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129663 Approved by: https://github.com/jansel, https://github.com/shunting314	2024-09-13 16:45:39 +00:00
PyTorch MergeBot	564d00f364	Revert "Fix clang-tidy warnings in Caffe2 code (#134935 )" This reverts commit 7cfd23636c8fa6fcbb8bf3ea34e15b847ec9ad9d. Reverted https://github.com/pytorch/pytorch/pull/134935 on behalf of https://github.com/izaitsevfb due to breaks internal builds, caffe2 is still used internally ([comment](https://github.com/pytorch/pytorch/pull/134935#issuecomment-2349368152))	2024-09-13 16:42:37 +00:00
drisspg	ae02d663cd	[FlexAttention] Fix output layout (#135882 ) We previously only supported the same v_head dim and + qk_head dim. When allowed for different head-dims I accidently kept the same query strides for the output. This PR fixes this bug as well it ensures that we always produce output in the same stride order as the input query. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135882 Approved by: https://github.com/yanboliang, https://github.com/Chillee	2024-09-13 16:36:05 +00:00
James Wu	ad2f0e9f81	Add remote cache time saved to compilation metrics (#135490 ) Summary: Record remote cache time saved via frame_phase_timing We add to the "phase" when remote cache hits and saves us time, so that we have a 1:1 correspondence between a frame and time saved. Test Plan: Internally run benchmark, see that it's populated in sandbox table after previous diff lands and logger config is actualized. Show that column exists in table: https://fburl.com/scuba/logger_staging_jjwu_30582a48f1ff9cf5f4ac50a4c40af/fp2te0ff Note that an earlier version of D62105258 had the column as a string so the staging table is a bit messed up. But you can see the most recent samples have the column populates as a float. Reviewed By: aorenste Differential Revision: D62106921 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135490 Approved by: https://github.com/aorenste	2024-09-13 16:35:51 +00:00
Edward Z. Yang	21ffa18ad1	Fix "expand: SymIntArrayRef expected to contain only concrete integers" in AOTInductor (#135933 ) Internal xref: https://fb.workplace.com/groups/1075192433118967/permalink/1501860707118802/ Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/135933 Approved by: https://github.com/angelayi	2024-09-13 15:23:42 +00:00
eqy	2519e5a8de	[CUDA][FP8] Skip rowwise scaling test on sm89 (#135718 ) Same reason as #https://github.com/pytorch/pytorch/pull/133612, rowwise scaling implementation is sm90+ specific (e.g., uses TMA) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135718 Approved by: https://github.com/Skylion007	2024-09-13 15:07:20 +00:00
Laith Sakka	ba6e0f31ab	Remove cycle dependency by localizing the import. (#135926 ) Summary: Since https://www.internalfb.com/diff/D62215095 landed there has been many silence errors due to the dependency between functional_tensor and config. ``` File "/tmp/torch_deploy_zip5YRJC1/torch_python_modules.zip/torch/export/__init__.py", line 64, in <module> File "/tmp/torch_deploy_zip5YRJC1/torch_python_modules.zip/torch/export/dynamic_shapes.py", line 23, in <module> File "/tmp/torch_deploy_zip5YRJC1/torch_python_modules.zip/torch/export/exported_program.py", line 26, in <module> File "/tmp/torch_deploy_zip5YRJC1/torch_python_modules.zip/torch/_higher_order_ops/__init__.py", line 1, in <module> File "/tmp/torch_deploy_zip5YRJC1/torch_python_modules.zip/torch/_higher_order_ops/cond.py", line 6, in <module> File "/tmp/torch_deploy_zip5YRJC1/torch_python_modules.zip/torch/_subclasses/functional_tensor.py", line 9, in <module> File "/tmp/torch_deploy_zip5YRJC1/torch_python_modules.zip/torch/_inductor/config.py", line 44, in <module> ``` https://fburl.com/logarithm/ol5kx0ee complaining about a cycle dependency this fix it. Test Plan: buck test multipy/runtime:test_deploy_embedded_cuda_interp_without_cuda_available -- --run-disabled TorchpyTest.AcquireMultipleSessionsInDifferentPackages Reviewed By: aorenste Differential Revision: D62616765 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135926 Approved by: https://github.com/aorenste, https://github.com/oulgen, https://github.com/Skylion007	2024-09-13 15:05:41 +00:00
PyTorch MergeBot	7ed0563cad	Revert "[Dynamo] Use custom backend to reenter metadata tf mode when tracing while/cond (#134732 )" This reverts commit e504fb70693d4a3741c3380b6a989d441e84f737. Reverted https://github.com/pytorch/pytorch/pull/134732 on behalf of https://github.com/albanD due to Broke tests on main ([comment](https://github.com/pytorch/pytorch/pull/134732#issuecomment-2348886378))	2024-09-13 12:52:58 +00:00
PyTorch MergeBot	eb7dd91dd1	Revert "[Dynamo] Trace torch function modes entered outside of torch.compile (#133137 )" This reverts commit fafdd588f27e1d56090c6d260d0382c255eaf9eb. Reverted https://github.com/pytorch/pytorch/pull/133137 on behalf of https://github.com/albanD due to Broke tests on main ([comment](https://github.com/pytorch/pytorch/pull/134732#issuecomment-2348886378))	2024-09-13 12:52:58 +00:00
PyTorch MergeBot	3f30360d05	Revert "[Dynamo] Support thread local setattr (#135443 )" This reverts commit 30b007bea329f512af3dc4fd4e6c7d145e807b71. Reverted https://github.com/pytorch/pytorch/pull/135443 on behalf of https://github.com/albanD due to Broke tests on main ([comment](https://github.com/pytorch/pytorch/pull/134732#issuecomment-2348886378))	2024-09-13 12:52:58 +00:00
PyTorch MergeBot	4734e356d6	Revert "[Dynamo] Simplify torch function mode stack guard (#135444 )" This reverts commit 0c080cb2c78a85a5320fbeadbbb9a2cc640fd89d. Reverted https://github.com/pytorch/pytorch/pull/135444 on behalf of https://github.com/albanD due to Broke tests on main ([comment](https://github.com/pytorch/pytorch/pull/134732#issuecomment-2348886378))	2024-09-13 12:52:57 +00:00
PyTorch MergeBot	ac169795a9	Revert "[Dynamo] Trace enter/exit of TorchFunctionModes (#135422 )" This reverts commit 2af3b8ffd84e36b91279174e9106f84b2d2a11f2. Reverted https://github.com/pytorch/pytorch/pull/135422 on behalf of https://github.com/albanD due to Broke tests on main ([comment](https://github.com/pytorch/pytorch/pull/134732#issuecomment-2348886378))	2024-09-13 12:52:57 +00:00
PyTorch MergeBot	fca58bfda1	Revert "[Dynamo] Remove ignored modes workaround (#135502 )" This reverts commit 7d5e0dd4b1a8d20fc8624b3085a6f5ddedd89a2e. Reverted https://github.com/pytorch/pytorch/pull/135502 on behalf of https://github.com/albanD due to Broke tests on main ([comment](https://github.com/pytorch/pytorch/pull/134732#issuecomment-2348886378))	2024-09-13 12:52:57 +00:00
PyTorch MergeBot	dc71e7a7d4	Revert "[Dynamo] Remove ignored modes from torch function mode stack guard (#135503 )" This reverts commit c56728b643e2b7d796abd7ec45803319e1c5967d. Reverted https://github.com/pytorch/pytorch/pull/135503 on behalf of https://github.com/albanD due to Broke tests on main ([comment](https://github.com/pytorch/pytorch/pull/134732#issuecomment-2348886378))	2024-09-13 12:52:57 +00:00
PyTorch MergeBot	1cdf658f4a	Revert "[PT2][inductor][Optimus] Add pad_aten_mm_pass pattern to resolve long computation kernel in LCE (#135167 )" This reverts commit eb0fe029337b31bcb3d4b2d1e539895393975d68. Reverted https://github.com/pytorch/pytorch/pull/135167 on behalf of https://github.com/jithunnair-amd due to Broke ROCm CI eg. https://github.com/pytorch/pytorch/actions/runs/10845542664/job/30097957154 ([comment](https://github.com/pytorch/pytorch/pull/135167#issuecomment-2348847595))	2024-09-13 12:35:05 +00:00
PyTorch MergeBot	b5c52e96e8	Revert "[dynamo] Fix support for classmethod(property(...)) (#134968 )" This reverts commit bf68e16e94fc05f10d434cdc162a14d02c6ad23c. Reverted https://github.com/pytorch/pytorch/pull/134968 on behalf of https://github.com/jithunnair-amd due to Broke ROCm CI: eg. https://github.com/pytorch/pytorch/actions/runs/10845542664/job/30097956613 ([comment](https://github.com/pytorch/pytorch/pull/134968#issuecomment-2348837553))	2024-09-13 12:29:03 +00:00
Bin Bao	ea2ecab15b	[AOTI][reland] Fix assert_function call in cpu autotune template (#135920 ) Summary: Reland https://github.com/pytorch/pytorch/pull/135086. In the ABI-compatible mode, assert_function should be AOTI_TORCH_CHECK. Test Plan: CI Differential Revision: D62500592 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135920 Approved by: https://github.com/chenyang78	2024-09-13 12:21:57 +00:00
CaoE	2f53d570fe	Update document for autocast on CPU (#135299 ) Update document for autocast on CPU due to the support of float16 and changes in the operator list. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135299 Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel, https://github.com/svekars	2024-09-13 09:11:47 +00:00
Ke Wen	31007cf200	[Distributed] add FP8 support to NaN checker (#135891 ) Adding support for `torch.float8_e4m3fn` and `torch.float8_e5m2` Pull Request resolved: https://github.com/pytorch/pytorch/pull/135891 Approved by: https://github.com/wconstab	2024-09-13 08:43:54 +00:00
Michael Lazos	c56728b643	[Dynamo] Remove ignored modes from torch function mode stack guard (#135503 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135503 Approved by: https://github.com/anijain2305 ghstack dependencies: #134732, #133137, #135443, #135444, #135422, #135502	2024-09-13 08:41:32 +00:00
Michael Lazos	7d5e0dd4b1	[Dynamo] Remove ignored modes workaround (#135502 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135502 Approved by: https://github.com/anijain2305 ghstack dependencies: #134732, #133137, #135443, #135444, #135422	2024-09-13 08:41:32 +00:00
Michael Lazos	2af3b8ffd8	[Dynamo] Trace enter/exit of TorchFunctionModes (#135422 ) This PR implements tracing of with contexts with TorchFunction modes which have the default enter/exit behavior (ie pushing/popping the mode) Typically the bytecode for a context manager looks like this during a graph break: 1. graph call 2. enter context 3. unsupported code 4. exit context 5. resume call resume fn structure: 1. enter context 2. jump ... 3. exit context The issue with torch function modes is that side effects will replay any mutations to the torch function stack performed during tracing. So, we do not need to enter and exit around the unsupported code in the original function (doing so would result in a duplicate torch function mode entry during execution of the unsupported code), and we don't need to enter again in the resume function (the mode that was pushed from the side effects bytecode would still be on the stack). So for torch function modes the structure of our output code is this: 1. graph call 2. mutate tf mode stack to replay mutations 4. unsupported code 5. on exception restore stack 6. resume function Then our resume fn looks like this: 1. no-op enter torch function mode 2. jump 3. exit tf mode To implement the no-op enter of the torch function mode I added torch function mode in polyfill which no-op enters, but normally exits. This is needed because we still want to trace the with context in the resume function, and exit properly (the exit instructions will still be in the function, so we need to generate instructions to set up the context). Separately from the bytecode, dynamo also tracks contexts on the block stack, which is how the SETUP_* instructions are implemented. Naturally at a graph break, we exit these block stacks to properly reset the contexts entirely, so that we can re-enter around the unsupported code soundly. However once again, in the torch function mode case, in the event of a graph we do not want to perform any exit side effects because we want to preserve the state of the mode stack as is so that we will properly update the stack with bytecode mentioned in the first section. If we exited here, dynamo would pop the mode off of the symbolic stack, and not update the true python torch function mode stack with the suffix bytecode. All in all, for torch function modes we enter exactly once, update the global torch function mode stack with side effects bytecode, re-read this stack when compiling the resume function, and exit exactly once in the resume function. This matches the semantics of eager exactly. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135422 Approved by: https://github.com/williamwen42 ghstack dependencies: #134732, #133137, #135443, #135444	2024-09-13 08:41:24 +00:00
Michael Lazos	0c080cb2c7	[Dynamo] Simplify torch function mode stack guard (#135444 ) The semantics of ignored modes previously had edge cases, this eliminates these by in essence filtering any ignored modes out of both the ref stack and the current torch function mode stack. This is purely to fix complexity in #135422. The ignored modes handling will be removed in a future PR after https://github.com/pytorch/pytorch/pull/135422 lands, since we will then trace through DeviceContexts vs inserting them into the graph which needed these extra workarounds for correctness. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135444 Approved by: https://github.com/anijain2305, https://github.com/williamwen42 ghstack dependencies: #134732, #133137, #135443	2024-09-13 08:41:17 +00:00
Michael Lazos	30b007bea3	[Dynamo] Support thread local setattr (#135443 ) In preparation for tracing through DeviceContext (`defb515306/torch/utils/_device.py (L66)`) This PR adds support for calling the setattr of thread local objects. These objects have a slots impl, and since this doesn't appear to have any side effects, we call this setattr impl when replaying mutations, since calling `object.__setattr__` on these objects results in a type error. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135443 Approved by: https://github.com/anijain2305 ghstack dependencies: #134732, #133137	2024-09-13 08:41:07 +00:00
Michael Lazos	fafdd588f2	[Dynamo] Trace torch function modes entered outside of torch.compile (#133137 ) This PR adds initial tracing for torch function modes. Details: In essence, this adds tracing into the torch function of modes entered outside of the torch.compile call. This does not yet support tracing enter/exit of a torch function mode/ tracing set_default_device properly using the new mode infra (this will be a very good stress test for modes). I am adding more PRs to this stack to support these. The overall plan is to support tracing enter/exit and handling graph breaks like we do other torch.* context managers. Previously landed: https://github.com/pytorch/pytorch/pull/133135 https://github.com/pytorch/pytorch/pull/133136 https://github.com/pytorch/pytorch/pull/133134 https://github.com/pytorch/pytorch/pull/133133 https://github.com/pytorch/pytorch/pull/133132 https://github.com/pytorch/pytorch/pull/133131 https://github.com/pytorch/pytorch/pull/133729 https://github.com/pytorch/pytorch/pull/133130 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133137 Approved by: https://github.com/jansel, https://github.com/zou3519 ghstack dependencies: #134732	2024-09-13 08:41:00 +00:00
Michael Lazos	e504fb7069	[Dynamo] Use custom backend to reenter metadata tf mode when tracing while/cond (#134732 ) For tracing cond/while in eager, we trace the HOP with the eager backend with metadata torchfunction mode enabled. HOPs disallow the mutation that occurs in this torch function mode, so it is not able to be traced. As a result, we use a custom backend which enters this mode for tracing these HOPs. Thanks to @ydwu4 for the help with implementing this Pull Request resolved: https://github.com/pytorch/pytorch/pull/134732 Approved by: https://github.com/ydwu4	2024-09-13 08:40:50 +00:00
Jez Ng	b346e99376	remove fast_flush arguments (#135387 ) I've removed them from upstream Triton in https://github.com/triton-lang/triton/pull/4485. It looks like most places in the code use the default value of `fast_flush=True` anyway, though there are two PRs from @pearu that use `False`. To my knowledge, there's no reason to use the `False` value. Differential Revision: [D62325778](https://our.internmc.facebook.com/intern/diff/D62325778) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135387 Approved by: https://github.com/nmacchioni, https://github.com/jansel	2024-09-13 08:13:46 +00:00
Animesh Jain	7dc1788396	[inductor] Remove the batch fusion passes from being a default (#135922 ) Ads team do a search internally to figure out which fusion passes to use. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135922 Approved by: https://github.com/eellison, https://github.com/yanboliang ghstack dependencies: #135819	2024-09-13 06:07:33 +00:00
xinan.lin	9fd54d787d	[Inductor UT] Generalize device-bias code in test_triton_kernels.py introduced in #135530 (#135656 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135656 Approved by: https://github.com/EikanWang, https://github.com/zou3519	2024-09-13 05:27:56 +00:00
xingyuan li	b38be727eb	[Inductor UT] Generalize inductor UT for intel GPU (Part 2) (#134556 ) [Inductor UT] Reuse Inductor test case for Intel GPU. Reuse `test/inductor/test_torchinductor_opinfo.py` Reuse `test/inductor/test_minifier_isolate.py` Pull Request resolved: https://github.com/pytorch/pytorch/pull/134556 Approved by: https://github.com/etaf, https://github.com/eellison	2024-09-13 05:16:28 +00:00
Jokeren	e54b559e88	[inductor] More fixes on the keys of `constants` and `signature` dictionaries (#135406 ) Previous PR forgets to change two other places that also create `constants` and `signature`. https://github.com/pytorch/pytorch/pull/135170 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135406 Approved by: https://github.com/jansel	2024-09-13 04:10:41 +00:00
wz337	eea5e6ff0f	[DCP][DSD] Add a test case to demonstrate the workaround to load full state dict into a 2D model (#135763 ) Fix https://github.com/pytorch/pytorch/issues/134095 This is a workaround for loading full state dict into a FSDP1+TP 2D model. Since named_parameters() in FSDP1 does not return DTensor, we don't have the information to shard the full_state_dict and load it directly into the 2d model. In order to load a full state dict in FSDP1+TP 2D model, we need to do: - load the full state dict into a 1D FSDP model - dcp.save the full/shard state dict into storage - initialize a 2D FSDP1+TP model - get the default sharded state dict for the 2D model (full_state_dict=False) - dcp.load the state dict from storage - load the state dict into the 2D model Pull Request resolved: https://github.com/pytorch/pytorch/pull/135763 Approved by: https://github.com/fegin ghstack dependencies: #135725	2024-09-13 03:51:14 +00:00
Pian Pawakapan	6df91b5917	real tensor prop for composite ops (#135717 ) Fixes #135632 Adds real tensor propagation for decompositions, checking any symbols on their outputs Pull Request resolved: https://github.com/pytorch/pytorch/pull/135717 Approved by: https://github.com/ezyang	2024-09-13 03:35:16 +00:00
wz337	0cdc6a8dcd	[DSD] Fix distributed state dict full_state_dict option hang during set_state_dict (#135725 ) Fix https://github.com/pytorch/pytorch/issues/134095 This fix distributed state dict full_state_dict option hang during set_state_dict. We switch `_distribute_tensors` in _state_dict_utils.py to use `DTensor.from_local` instead of `distribute_tensor` to support FSDP2+TP 2D strided sharding use case, as `distribute_tensor` cannot handle strided sharding yet. `distribute_tensor` incurs a scatter behind the scenes, while `DTensor.from_local` takes the local slice from the full tensor on each rank to create the DTensor (no collective). This means it's the user's responsibility to make sure the full_tensor from the full_state_dict is the same across all ranks. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135725 Approved by: https://github.com/fegin	2024-09-13 03:26:36 +00:00
Prachi Gupta	6cdc70bccd	[ROCm] skip test_fp8_cast_and_t on non-MI300 machines (#135917 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/135917 Approved by: https://github.com/malfet	2024-09-13 02:46:48 +00:00
Yu, Guangye	e6b68359d7	Fix xpu memory stats error (#135818 ) # Motivation fix https://github.com/pytorch/pytorch/issues/135726 After merging two free blocks, I made a stupid mistake of ignoring the correct size to decrease the active memory size, which should be the original block size instead of the merged block size. # Additional Context Add a UT to guard this scenario. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135818 Approved by: https://github.com/EikanWang	2024-09-13 02:41:21 +00:00
Nikita Shulga	1c04cbfba6	[BE] Use `C10_UNUSED` (#135914 ) Instead of `(void)foo; // Suppress unused variable` Pull Request resolved: https://github.com/pytorch/pytorch/pull/135914 Approved by: https://github.com/huydhn, https://github.com/eqy	2024-09-13 02:27:07 +00:00
Shivam Raikundalia	062681a0ed	[Profiler] Torch Profiler distributed info is not JSON serializable (#135548 ) Summary: To fix https://github.com/pytorch/pytorch/issues/133308 we must create an encoder for numpy values so we can serialize the distributed metadata to JSON. Test Plan: Added unit test to check that numpy values can be serialized Differential Revision: D62411619 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135548 Approved by: https://github.com/aaronenyeshi, https://github.com/albanD	2024-09-13 02:22:33 +00:00
Aaron Orenstein	8c356ce3da	Fix lint errors in fbcode (#135614 ) Summary: Fixed a bunch of fbcode imports that happened to work but confused autodeps. After this autodeps still suggests "improvements" to TARGETS (which breaks our builds) but at least it can find all the imports. Test Plan: ``` fbpython fbcode/tools/build/buck/linters/lint_autoformat.py --linter=autodeps --default-exec-timeout=1800 -- fbcode/caffe2/TARGETS fbcode/caffe2/test/TARGETS ``` Before: ``` ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "test_export" (from caffe2/test/export/testing.py:229) when processing rule "test_export". Please make sure it's listed in the srcs parameter of another rule. See https://fbur$ ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "testing" (from caffe2/test/export/test_export.py:87) when processing rule "test_export". Please make sure it's listed in the srcs parameter of another rule. See https://fburl$ ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "test_export" (from caffe2/test/export/test_serdes.py:9) when processing rule "test_export". Please make sure it's listed in the srcs parameter of another rule. See https://fb$ ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "testing" (from caffe2/test/export/test_serdes.py:10) when processing rule "test_export". Please make sure it's listed in the srcs parameter of another rule. See https://fburl$ ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "testing" (from caffe2/test/export/test_retraceability.py:7) when processing rule "test_export". Please make sure it's listed in the srcs parameter of another rule. See https:$ ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "test_export" (from caffe2/test/export/test_retraceability.py:6) when processing rule "test_export". Please make sure it's listed in the srcs parameter of another rule. See ht$ ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "testing" (from caffe2/test/export/test_export_nonstrict.py:7) when processing rule "test_export". Please make sure it's listed in the srcs parameter of another rule. See http$ ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "test_export" (from caffe2/test/export/test_export_nonstrict.py:6) when processing rule "test_export". Please make sure it's listed in the srcs parameter of another rule. See $ ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "test_export" (from caffe2/test/export/test_export_training_ir_to_run_decomp.py:8) when processing rule "test_export". Please make sure it's listed in the srcs parameter of an$ ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "testing" (from caffe2/test/export/test_export_training_ir_to_run_decomp.py:10) when processing rule "test_export". Please make sure it's listed in the srcs parameter of anoth$ ERROR while processing caffe2/test/TARGETS: Found "//python/typeshed_internal:typeshed_internal_library" owner for "cv2" but it is protected by visibility rules: [] (from caffe2/test/test_bundled_images.py:7) when processing rule "test_bundled_$ ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "caffe2.test.profiler_test_cpp_thread_lib" (from caffe2/test/profiler/test_cpp_thread.py:29) when processing rule "profiler_test_cpp_thread". Please make sure it's listed in t$ ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "torch._utils_internal.get_file_path_2" (from caffe2/test/test_custom_ops.py:23) when processing rule "custom_ops". Please make sure it's listed in the srcs parameter of anoth$ ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "torch._utils_internal.get_file_path_2" (from caffe2/test/test_public_bindings.py:13) when processing rule "public_bindings". Please make sure it's listed in the srcs paramete$ ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "torch._C._profiler.symbolize_tracebacks" (from caffe2/test/test_cuda.py:3348) when processing rule "test_cuda". Please make sure it's listed in the srcs parameter of another $ ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "torch._C._profiler.gather_traceback" (from caffe2/test/test_cuda.py:3348) when processing rule "test_cuda". Please make sure it's listed in the srcs parameter of another rule$ ERROR while processing caffe2/test/TARGETS: Cannot find an owner for include <torch/csrc/autograd/profiler_kineto.h> (from caffe2/test/profiler/test_cpp_thread.cpp:2) when processing profiler_test_cpp_thread_lib. Some things to try: ``` Differential Revision: D62049222 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135614 Approved by: https://github.com/oulgen, https://github.com/laithsakka	2024-09-13 02:04:34 +00:00
Jason Ansel	bf68e16e94	[dynamo] Fix support for classmethod(property(...)) (#134968 ) Fixes #134451 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134968 Approved by: https://github.com/yanboliang	2024-09-13 01:14:18 +00:00
eqy	d732df7e56	[Inductor] Disable TF32 in `test_slice_scatter_reinplace` (#135709 ) TF32 linear/matmul numerics seem unrelated to test functionality so disabling it here to abate noisy failures Pull Request resolved: https://github.com/pytorch/pytorch/pull/135709 Approved by: https://github.com/eellison	2024-09-13 00:30:45 +00:00
Sahan Paliskara	c9de2efde6	[Docs] fix inconsistent docs in conv1d, conv2d, and conv3d (#135894 ) Addresses https://github.com/pytorch/pytorch/issues/135880 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135894 Approved by: https://github.com/mikaylagawarecki, https://github.com/malfet	2024-09-13 00:19:42 +00:00
Jason Ansel	1f15c0c7a5	[fx] Replace _snake_case with a regexp (#135822 ) ~2x speedup on this function, though saves <0.5s overall Pull Request resolved: https://github.com/pytorch/pytorch/pull/135822 Approved by: https://github.com/oulgen ghstack dependencies: #135787, #135788, #135820, #135821	2024-09-13 00:18:41 +00:00
Jason Ansel	a72124add9	[fx] Minor optimization in create_arg (#135821 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135821 Approved by: https://github.com/oulgen ghstack dependencies: #135787, #135788, #135820	2024-09-13 00:18:41 +00:00
Jason Ansel	10ca4c0564	[inductor] Use TracerBase directly in LoopBody (#135820 ) This skips some unneeded work in the subclass. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135820 Approved by: https://github.com/oulgen ghstack dependencies: #135787, #135788	2024-09-13 00:18:41 +00:00
Jason Ansel	d3aab9642b	[inductor] Optimize can_fuse_vertical() (#135788 ) An O(n^2) to O(n) improvement by not comparing all pairs of deps. Before: ![image](https://github.com/user-attachments/assets/797cd1bd-5d53-4374-8e76-ffce4232d7f9) After: ![image](https://github.com/user-attachments/assets/1e61bf29-adba-41a4-839e-f028130fa979) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135788 Approved by: https://github.com/oulgen ghstack dependencies: #135787	2024-09-13 00:18:41 +00:00
Jason Ansel	67a929eea8	[inductor] Remove unused check (#135787 ) I think this is unreachable code because mode is always None on reads. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135787 Approved by: https://github.com/oulgen	2024-09-13 00:18:41 +00:00
Isuru Fernando	f576960bbc	do not expand in replace/simplify if no changes (#135863 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135863 Approved by: https://github.com/ezyang	2024-09-13 00:12:01 +00:00
Nikita Shulga	1aba224cfd	Update nightly PyTorch version to 2.6.0 (#135916 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/135916 Approved by: https://github.com/kit1980	2024-09-13 00:08:52 +00:00
Shangdi Yu	d383325392	[aoti] Fix workspace generation for triton (#135552 ) Fixes #131337 - add `arg_type` for workspace_arg, the type is consistent with the type in `generate_workspace_allocation()`. - do not generate example tensors for `workspace`, and use `generate_workspace_allocation()` instead. - add workspace allocation generation code to `kernel_autotune_calls`. e.g. ```python workspace = empty_strided_cuda((1280, ), (1, ), torch.uint8) workspace.zero_() ..... triton_spl_fused_add_cumprod_0.run(buf2, arg0_1, arg1_1, workspace, 1, 10000, grid=split_scan_grid(1, 10000), stream=stream0) del buf2, arg0_1, arg1_1, workspace ``` - add `empty_strided_cuda = torch._C._dynamo.guards._empty_strided_cuda` to the header of triton autotune code. The generated cpp has lines like below, so we also implement a `zero_()` for ` AtenTensorHandle `. ```cpp static constexpr int64_t int_array_0[] = {1280L, }; static constexpr int64_t int_array_1[] = {1L, }; AtenTensorHandle workspace_handle; AOTI_TORCH_ERROR_CODE_CHECK(aoti_torch_empty_strided(1, int_array_0, int_array_1, cached_torch_dtype_uint8, cached_torch_device_type_cuda, 0, &workspace_handle)); RAIIAtenTensorHandle workspace(workspace_handle); workspace.zero_(); ``` - Fix handle grid_fn for grid computation. Pass in "RBLOCK" to `split_scan_grid` - Fix dynamic shapes: Without the fix we generate code that looks like this `workspace = empty_strided_cuda((32((255 + s0) // 256), ), (1, ), torch.uint8)` when doing triton autotune and `s0` is not defined. The solution approach is to use `V.graph.sizevars.size_hint(nbytes)` to realize the workspace size for triton autotune. Note that we only realize it for triton autotune code, but not for the cpp cuda code. - We also generate slightly different cpp code depending on if `abi_compatible` is turned on. ```cpp RAIIAtenTensorHandle workspace(workspace_handle); AOTI_TORCH_ERROR_CODE_CHECK(aoti_torch_zero_(workspace.get())); ``` vs ```cpp at::Tensor workspace = at::detail::empty_strided_cuda({8L(c10::div_floor_integer(static_cast<int64_t>((255L + s0)), static_cast<int64_t>(256L))), }, {1L, }, at::kByte, c10::DeviceType::CUDA); workspace.zero_(); ``` Test Plan: ``` TORCHINDUCTOR_ABI_COMPATIBLE=1 TORCHINDUCTOR_CPP_WRAPPER=1 python test/inductor/test_torchinductor.py -k GPUTests.test_consecutive_split_cumprod_cuda python test/inductor/test_cuda_cpp_wrapper.py TestCudaWrapper.test_consecutive_split_cumprod_cuda_cuda_wrapper python test/inductor/test_cuda_cpp_wrapper.py DynamicShapesCudaWrapperCudaTests.test_consecutive_split_cumprod_cuda_dynamic_shapes_cuda_wrapper TORCHINDUCTOR_ABI_COMPATIBLE=1 python test/inductor/test_cuda_cpp_wrapper.py TestCudaWrapper.test_consecutive_split_cumprod_cuda_cuda_wrapper TORCHINDUCTOR_CPP_WRAPPER=1 python test/inductor/test_torchinductor.py -k GPUTests.test_consecutive_split_cumprod_cuda ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/135552 Approved by: https://github.com/desertfire	2024-09-12 23:53:09 +00:00
Ma Jian	00dc7d4356	fix compiled_autograd deadlock throw (#135795 ) Fixes #135298 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135795 Approved by: https://github.com/xmfan	2024-09-12 23:24:57 +00:00
Yanbo Liang	1760bbc259	[FlexAttention] Ensure q/k/v and block_mask on excact the same device (#135823 ) Fixes #134739 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135823 Approved by: https://github.com/BoyuanFeng	2024-09-12 23:11:01 +00:00
Jack Taylor	fb9d8e3248	[ROCm] Use ieee precision for fp32 in flex attention (#135702 ) `3bebc09be9` Brought in a change to flex_attention to allow TF32 precision, this largely lacks support on ROCm side and we should use ieee. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135702 Approved by: https://github.com/jeffdaily, https://github.com/drisspg	2024-09-12 23:00:48 +00:00
eellison	aaabfc8930	[Easy] Check if quant registered in constant folding (#135875 ) Belated fix for https://github.com/pytorch/pytorch/issues/110904 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135875 Approved by: https://github.com/shunting314	2024-09-12 22:16:39 +00:00
William Wen	63d6cd351a	[dynamo] support torch.nn.attention.sdpa_kernel context manager (#135404 ) Fixes https://github.com/pytorch/pytorch/issues/134608 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135404 Approved by: https://github.com/jansel, https://github.com/drisspg	2024-09-12 22:04:48 +00:00
PyTorch MergeBot	3de9e474df	Revert "Check function declarations of Core ML code (#135467 )" This reverts commit bc1b8f094d24de27432f4c29f0729e85a6b5ba63. Reverted https://github.com/pytorch/pytorch/pull/135467 on behalf of https://github.com/malfet due to This breaks ios periodic jobs, see https://github.com/pytorch/pytorch/actions/runs/10797026668/job/29947377532 ([comment](https://github.com/pytorch/pytorch/pull/135467#issuecomment-2347322784))	2024-09-12 22:04:35 +00:00
PyTorch MergeBot	3e1a4ea132	Revert "[DSD] Fix distributed state dict full_state_dict option hang during set_state_dict (#135725 )" This reverts commit 83c594ebd6dfa517fdd67ae23929cc60d5fa325d. Reverted https://github.com/pytorch/pytorch/pull/135725 on behalf of https://github.com/ZainRizvi due to This is breaking lint. See [GH job link](https://github.com/pytorch/pytorch/actions/runs/10835983999/job/30068709508) [HUD commit link](`83c594ebd6`) ([comment](https://github.com/pytorch/pytorch/pull/135725#issuecomment-2347303272))	2024-09-12 21:47:38 +00:00
Sanskar Modi	e157ce3ebb	Validate input types for `torch.nn.Linear` and `torch.nn.Bilinear` (#135596 ) Adding validation checks to check the input types and display better error messages for the same. Fixes https://github.com/pytorch/pytorch/issues/135463 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135596 Approved by: https://github.com/malfet	2024-09-12 21:28:37 +00:00
Pian Pawakapan	b897ab0540	[export] ignore mark_dynamic() in export (#135536 ) Previously we were accomodating `torch._dynamo.mark_dynamic()` for export's dynamic shapes. Here we clean things up and ignore it, requiring users to specify an export input for `dynamic_shapes`. Note: there's 4 decorators relevant to export, `mark_dynamic, maybe_mark_dynamic, mark_static, mark_unbacked`. User calls that involve export have only been `mark_dynamic()`, and we use `maybe_mark_dynamic` under the hood for `Dim.AUTO`, but we could start using others. One reason I decided to not warn and just silently ignore is these decorators cause the tensors to carry dynamic info, and it'll be hard to tell whether the markers are from export or user calls when re-exporting with the same inputs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135536 Approved by: https://github.com/avikchaudhuri	2024-09-12 21:22:19 +00:00
Fadi Arafeh	3d24313809	Pass ideep:lowp_kind to matmul_forward::compute on cache misses (#135058 ) Optimized dynamic quantization for aarch64 was enabled by #126687 and #134897 This PR fixes an issue for aarch64 where on a [cache miss](https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/native/quantized/cpu/qlinear_dynamic.cpp#L592) (e.g. if input dimensions change) [ideep::matmul_forward::compute ](https://github.com/intel/ideep/blob/pytorch-rls-v3.5.3-2/include/ideep/operators/matmul.hpp#L160) (wrongly) runs with the [default lowp_kind (u8s8)](https://github.com/intel/ideep/blob/pytorch-rls-v3.5.3-2/include/ideep/operators/matmul.hpp#L174) which is not supported by oneDNN+ACL (Arm Compute Library), causing the workload to fall back to a much slower oneDNN gemm:jit kernel Example: ```python import torch DIM = 4096 INPUT_SIZE1 = 32 INPUT_SIZE2 = 16 class LinearNet(torch.nn.Module): def __init__(self): super().__init__() self.fc1 = torch.nn.Linear(DIM, DIM, bias=False) def forward(self, x): x = self.fc1(x) return x input1 = torch.randn(size=(INPUT_SIZE1, DIM)) input2 = torch.randn(size=(INPUT_SIZE2, DIM)) with torch.no_grad(): model = LinearNet() model = torch.ao.quantization.quantize_dynamic(model,{torch.nn.Linear}) model(input1) # this goes to ACL lowp_gemm print("="50) model(input2) # this goes to gemm:jit without this PR, and to ACL with this PR ``` In the code snippet above: - The matmul from `model(input1)` goes to oneDNN+ACL (in both cases, with and without the PR) - The matmul from `model(input2)`: Without this PR: there's a cache miss (different input shapes) and matmul_forward::compute is run with the default lowp_kind (u8s8). Hence the matmul falls back to gemm:jit in oneDNN. However, With this PR* the matmul goes to oneDNN+ACL which is around 10x faster than oneDNN+jit. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135058 Approved by: https://github.com/jondea, https://github.com/malfet	2024-09-12 20:30:20 +00:00
Riley Dulin	cd472bb1e3	[torch][fx] Add new replacement_callback to materialize a replacement just in time (#135553 ) Summary: Sometimes we only want to generate a replacement for a matched pattern once we know some information about the nodes in the pattern. So far, we have found this the most useful to do matches based on specific shapes of tensors flowing into functions. Use a callback function similar to `match_filters`. By default this isn't used. Had to make `replacement` a None-able parameter because Callable was already used to detect a case where a graph needed to be traced. Differential Revision: D62412628 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135553 Approved by: https://github.com/SherlockNoMad	2024-09-12 18:52:14 +00:00
Guilherme Leobas	f032135bbf	Add batching rule for torch.scatter_reduce (#135547 ) Fixes #134797 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135547 Approved by: https://github.com/zou3519	2024-09-12 18:51:21 +00:00
Joel Schlosser	525bec804c	NJT <-> padded dense conversions (#125947 ) This PR: * Implements the pre-existing `nt.to_padded_tensor(padding_val)` ATen op via the FBGEMM kernel + appropriate view gymnastics (since that kernel only handles 2D values) * Introduces a new `_nested_from_padded_tensor` op for the reverse conversion, implemented via the reverse FBGEMM kernel + view gymnastics * Note: there is currently no public API for this; design booted to a future PR TODO: * ~~Propagate min / max sequence length via the new factory function `_nested_from_padded_tensor`~~ * ~~Verify that Inductor does computation fusion via test logic~~ Pull Request resolved: https://github.com/pytorch/pytorch/pull/125947 Approved by: https://github.com/soulitzer	2024-09-12 17:54:25 +00:00
wz337	83c594ebd6	[DSD] Fix distributed state dict full_state_dict option hang during set_state_dict (#135725 ) Fix https://github.com/pytorch/pytorch/issues/134095 This fix distributed state dict full_state_dict option hang during set_state_dict. We switch `_distribute_tensors` in _state_dict_utils.py to use `DTensor.from_local` instead of `distribute_tensor` to support FSDP2+TP 2D strided sharding use case, as `distribute_tensor` cannot handle strided sharding yet. `distribute_tensor` incurs a scatter behind the scenes, while `DTensor.from_local` takes the local slice from the full tensor on each rank to create the DTensor (no collective). This means it's the user's responsibility to make sure the full_tensor from the full_state_dict is the same across all ranks. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135725 Approved by: https://github.com/fegin	2024-09-12 17:43:57 +00:00
Rachel Guo	c1277945d3	[AOTI][Tooling] Support debug printing for inductor level extern kernel call such as externkernel.addmm, bmm, etc. (#135731 ) Summary: As title. Effect after merging this diff would look something like this: ``` print('inductor: before_launch - triton_poi_fused_0 - buf0', buf0) triton_poi_fused_0.run(buf0, 6, grid=grid(6), stream=stream0) print('inductor: after_launch - triton_poi_fused_0 - buf0', buf0) buf1 = empty_strided_cuda((16, 6), (6, 1), torch.float32) # Topologically Sorted Source Nodes: [linear], Original ATen: [aten.addmm] print('inductor: before_launch - extern_kernels.addmm - buf0', buf0) extern_kernels.addmm(buf0, reinterpret_tensor(arg2_1, (16, 16), (16, 1), 0), reinterpret_tensor(L__self___weight, (16, 6), (1, 16), 0), alpha=1, beta=1, out=buf1) print('inductor: after_launch - extern_kernels.addmm - buf0', buf0) ``` Context: D62272588 only support major triton kernel jit inductor debug printing codegen Test Plan: CI & OSS CI Reviewed By: chenyang78, ColinPeppler Differential Revision: D62397017 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135731 Approved by: https://github.com/ColinPeppler	2024-09-12 17:31:10 +00:00
Isuru Fernando	dab7d646d5	Use a better decomposition for split_with_sizes (#135728 ) This decomposition has less checks and improves the performance of torch.compile. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135728 Approved by: https://github.com/ezyang	2024-09-12 16:38:51 +00:00
whywhy-rtx3090	7647c398ff	Allow optional positional arguments for `torch.func.functional_call` (#134643 ) This PR resolves #134408. Add an additional test and have passed the local test. Do you think we should add a post-check to ensure `args` and `kwargs` are not both `None`? It seems to be possible to have modules without inputs. This PR does not include any such post-check. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134643 Approved by: https://github.com/zou3519	2024-09-12 15:22:06 +00:00
Justin Chu	d67cc58181	[ONNX] Fix symbolic values and numpy implementation (#135786 ) 1. Remove `__eq__` to make `SymbolicTensor` hashable and test for that 2. Update the `__array__` method so that it works for tensor on GPU Fixes https://github.com/pytorch/pytorch/issues/135700 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135786 Approved by: https://github.com/titaiwangms	2024-09-12 14:24:43 +00:00
Animesh Jain	dddaadac6c	[dynamo] Dont graph break on inner torch.compile (#135819 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135819 Approved by: https://github.com/jansel	2024-09-12 11:39:09 +00:00
Jason Ansel	02169364e1	[inductor] Split reduction loops when there is no shared reads (#134307 ) Fixes #129102 ![image](https://github.com/user-attachments/assets/0d00f75b-2bb9-4ce6-a0d9-2daceaff539c) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134307 Approved by: https://github.com/shunting314	2024-09-12 09:45:08 +00:00
Yanbo Liang	c30042fbeb	[GPT-fast] Update compilation time target for Llama & Mixtral (#135817 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135817 Approved by: https://github.com/xmfan, https://github.com/huydhn	2024-09-12 07:13:44 +00:00
Sun, Jiayi	6700175531	[Inductor] simplify indexing_exprs in LoopBody._init_with_copy (#135574 ) This PR uses `var_ranges` information to simplify `indexing_exprs` in `LoopBody._init_with_copy` to to reduce occurrences of `FloorDiv` and `ModularIndexing` in the `indexing_exprs`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135574 Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel, https://github.com/jansel	2024-09-12 06:56:34 +00:00
Xilun Wu	de8a8653c0	[dtensor][BE] replace compute_local_shape with compute_local_shape_and_global_offset (#135554 ) Summary 1. This PR removes the public API `compute_local_shape` and replace its use with the more general API `compute_local_shape_and_global_offset`. 2. To keep `compute_local_shape_and_global_offset` consistent with `compute_local_shape` on empty shards, it now returns local tensor shape `(0,)` for empty shards which is more aligned with DTensor's semantics on non-participating ranks. Test `pytest test/distributed/_tensor/test_dtensor.py` `pytest test/distributed/_tensor/test_init.py` `pytest test/distributed/_tensor/test_tensor_ops.py` Differential Revision: [D62415591](https://our.internmc.facebook.com/intern/diff/D62415591) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135554 Approved by: https://github.com/tianyu-l, https://github.com/wz337	2024-09-12 06:30:09 +00:00
Jason Ansel	86335e9135	[reland 3/3][fx] Bypass custom __setattr__ in Node.__init__ (#135735 ) Relands #135079 whcih was reverted by #135562 I broke this up into three parts to test internally. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135735 Approved by: https://github.com/oulgen	2024-09-12 05:50:39 +00:00
angelayi	14e3f3c062	[aoti] Remove nlohmann/json.hpp from header (#135765 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135765 Approved by: https://github.com/malfet	2024-09-12 05:38:51 +00:00
Dmitry Rogozhkin	9852c6d236	xpu: fix 3rd party builds on systems with cmake<3.25 (#135767 ) Cmake LINUX variable is available on starting from cmake 3.25. Better to use CMAKE_SYSTEM_NAME instead to relax cmake version requirement. See: https://cmake.org/cmake/help/v3.25/variable/LINUX.html Fixes: #135766 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135767 Approved by: https://github.com/malfet, https://github.com/guangyey	2024-09-12 05:31:01 +00:00
Jason Ansel	6354271178	[inductor] Skip unused call to get_estimated_runtime() (#135776 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135776 Approved by: https://github.com/oulgen ghstack dependencies: #135445, #135446	2024-09-12 05:22:23 +00:00
Jason Ansel	12902f6ecf	[inductor] Cache get_operation_names/get_buffer_names (#135446 ) Before: ![image](https://github.com/user-attachments/assets/db5b6fce-d849-4512-a21d-7a09efc72311) After: ![image](https://github.com/user-attachments/assets/097e340c-03b2-491e-ad36-132350b37892) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135446 Approved by: https://github.com/oulgen ghstack dependencies: #135445	2024-09-12 05:22:23 +00:00
Jason Ansel	3decb676aa	[inductor] Optimize cache_on_self (#135445 ) This is a small compile time win, but also makes profiles more readable. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135445 Approved by: https://github.com/oulgen	2024-09-12 05:22:23 +00:00
Zhenbin Lin	8d68a02905	OpenReg: Split the daemon into drvier/executor (#135646 ) Split the daemon into a proper user-process driver vs device-process executor. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135646 Approved by: https://github.com/albanD	2024-09-12 05:03:46 +00:00
Jason Ansel	28330a8a39	[reland 1/3][fx] Bypass custom __setattr__ in Node.__init__ (#135733 ) Relands #135079 whcih was reverted by #135562 I broke this up into three parts to test internally. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135733 Approved by: https://github.com/oulgen	2024-09-12 04:29:37 +00:00
Animesh Jain	eaba287adb	[dynamo] Bug fix for _torchdynamo_inline source handling (#135612 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135612 Approved by: https://github.com/drisspg	2024-09-12 04:05:08 +00:00
cyy	f5f1d0a753	Fix build warnings for torch_python (#134981 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/134981 Approved by: https://github.com/ezyang	2024-09-12 03:59:34 +00:00
Adam J. Stewart	5bc238c73e	torch.hub: add get_dir/set_dir type hints (#134906 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134906 Approved by: https://github.com/Skylion007	2024-09-12 03:53:29 +00:00
He Kai	79223114db	Avoid inserting extra transpose when the input to group norm is NHWC (#135575 ) When the input format for group norm is NHWC and the device is privateuseone, it introduces an additional transpose operation. To avoid this issue, a check for the privateuseone device needs to be added here. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135575 Approved by: https://github.com/ezyang	2024-09-12 03:36:05 +00:00
cyy	7cfd23636c	Fix clang-tidy warnings in Caffe2 code (#134935 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/134935 Approved by: https://github.com/ezyang	2024-09-12 03:27:09 +00:00
Feng Yuan	0d1d69fd25	Update torch-xpu-ops pin (ATen XPU implementation) (#135647 ) Release cycle for PyTorch 2.5 1. Fixing runtime error on Windows: Fail to load torch_xpu_ops_unary_binary_kernels.dll as the bin size is large. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135647 Approved by: https://github.com/EikanWang	2024-09-12 03:16:08 +00:00
Aaron Orenstein	21a64d57b1	[BE] typing for decorators - masked/_ops (#135108 ) Differential Revision: D62184735 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135108 Approved by: https://github.com/Skylion007	2024-09-12 01:34:09 +00:00
Shangdi Yu	1a74952925	"Remove BLOCK_LIST" (#135729 ) Summary: Skip test_prepare_qat_conv_bn_fusion_getitem_placeholder when we use training ir, since it's only for bn-getitem pattern, but the pattern doesn't exist in training ir. Remove BLOCK_LIST since it's empty. Now all internal unittests will use training ir. Test Plan: ``` buck2 run 'fbcode//mode/dev-nosan' caffe2/test/quantization:test_quantization -- -r test_prepare_qat_conv_bn_fusion_getitem_placeholder buck2 run 'fbcode//mode/dev-nosan' caffe2/test:quantization_pt2e_qat -- -r test_prepare_qat_conv_bn_fusion_getitem_placeholder ``` Differential Revision: D62387987 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135729 Approved by: https://github.com/tugsbayasgalan	2024-09-12 01:22:06 +00:00
Huy Do	a130ed828a	Fix the upload of x86 micro benchmark results (#135780 ) Upload stats workflow currently skips this https://github.com/pytorch/pytorch/actions/runs/10807251335/job/29977650639, this is a miss from https://github.com/pytorch/pytorch/pull/135042. So, the workflow is running but nothing has been uploaded yet. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135780 Approved by: https://github.com/atalman	2024-09-12 01:16:38 +00:00
Menglu Yu	eb0fe02933	[PT2][inductor][Optimus] Add pad_aten_mm_pass pattern to resolve long computation kernel in LCE (#135167 ) Summary: We observed another long computation issue for OBA_AFOC pyper model, thus adding a pattern to avoid the perf regression - Only happens in A100 - Do not want to use force_shape_pad since it will pad all GEMMs, which may not be optimal. Optimus pass has more flexisibility to customized GEMM shape and do corresponding padding - To enable, we pass the pass to config, where "k_threshold_to_pad" can be customized inductor_config.patch(post_grad_fusion_options={"pad_aten_mm_pass": {"k_threshold_to_pad" : 8388608}}) Test Plan: # unit test ``` buck2 test mode/opt //caffe2/test/inductor:pad_mm ``` Buck UI: https://www.internalfb.com/buck2/58b0f272-f405-45be-bc8d-aec2dc4d5841 Test UI: https://www.internalfb.com/intern/testinfra/testrun/10133099209954651 Network: Up: 9.0KiB Down: 142B (reSessionID-8eb71a37-a5ca-4aff-a4f1-93ade3e47e4e) Jobs completed: 9. Time elapsed: 3:18.0s. Cache hits: 0%. Commands: 3 (cached: 0, remote: 0, local: 3) Tests finished: Pass 17. Fail 0. Fatal 0. Skip 0. Build failure 0 # e2e test see [D62388582](https://www.internalfb.com/diff/D62388582) Differential Revision: D62220158 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135167 Approved by: https://github.com/jackiexu1992	2024-09-12 00:51:34 +00:00
Wei Feng	d270e2d240	[FSDP2] better error msg for cpu offloading (#135156 ) when cpu offloading is enabled, if user load a gpu state dict, FSDP2 will throw a less obvious error at backward ``` RuntimeError: attempting to assign a gradient with device type 'cpu' to a tensor with device type 'cuda'. Please ensure that the gradient and the tensor are on the same device ``` this PR throws error more explicitly by specifying which parameters should be moved because of cpu offloading ``` FSDP parameters should be materialized on cpu when enabling cpu offloading. For example, load cpu state dict or call module.to_empty(device="cpu"). Found following parameters on non-cpu device: ['0.weight'] ``` `pytest -s test/distributed/_composable/fsdp/test_fully_shard_state_dict.py -k test_dp_state_dict_cpu_offload` Pull Request resolved: https://github.com/pytorch/pytorch/pull/135156 Approved by: https://github.com/awgu	2024-09-12 00:05:07 +00:00
xinan.lin	16b37b309f	[Inductor] Rename `cpp_wrapper_cuda.py` as `cpp_wrapper_gpu.py` (#135313 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135313 Approved by: https://github.com/jansel, https://github.com/desertfire ghstack dependencies: #135312	2024-09-11 23:59:54 +00:00
xinan.lin	13ee85ca5e	[Inductor] Generalize cuda cpp wrapper as common triton based GPU cpp wrapper, will be reused by xpu in next PR. (#135312 ) [Inductor] Generalize cuda cpp wrapper as common triton based GPU cpp wrapper, will be reused by xpu in next PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135312 Approved by: https://github.com/jansel, https://github.com/desertfire, https://github.com/eellison	2024-09-11 23:59:54 +00:00
Will Feng	94d2471d1f	[Traceable FSDP2] Use .copy_ instead of .set_ for unsharded_param inplace update; Replace unsharded_param graph input usage with graph intermediate; Support FSDP2+LoRA (#133730 ) Using `fsdp.set_` for unsharded_param inplace update causes difficult-to-debug errors when enabling Traceable FSDP2 on TorchTune models. In this PR, we change it to use `fsdp.copy_` which fixes the error and also strictly follows eager semantics (i.e. if user explictly stores an alias of the unsharded_param during execution of the user's module code, that alias will get updated correctly when the unsharded_param is copy_ into; whereas if we just swap out unsharded_param storage via set_, that user-saved alias will not get updated, which is not good). This PR also implements the graph pass to remove the resizes and copy if there is a resize_(full) -> copy_ -> resize_(0) pattern. ------ Test commands: - `pytest -rA test/distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_transformer_backend_inductor` - `pytest -rA test/distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_nested_fully_shard_backend_inductor` - `pytest -rA test/distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_trace_fsdp_copy_` - `pytest -rA test/dynamo/test_repros.py::ReproTests::test_partitioner_cse_respects_mutation_boundaries` - `pytest -rA test/dynamo/test_repros.py::ReproTests::test_fsdp_set_input_mutation_applied_when_input_gets_no_gradients` - `pytest -rA test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mutation_op_matching` - `python test/inductor/test_distributed_patterns.py DistributedPatternTests.test_fake_distributed_aot_eager` - `PYTORCH_OPINFO_SAMPLE_INPUT_INDEX=1 PYTORCH_TEST_WITH_CROSSREF=1 python test/functorch/test_aotdispatch.py TestEagerFusionOpInfoCPU.test_aot_autograd_exhaustive_norm_cpu_float32` - `python test/distributed/test_inductor_collectives.py TestCollectivesInductor.test_backwards` Pull Request resolved: https://github.com/pytorch/pytorch/pull/133730 Approved by: https://github.com/bdhirsh	2024-09-11 23:01:05 +00:00
Alexander Jipa	5ca46be15e	Fix/torch cat doc attr (#135698 ) The `torch.cat` attr name for tensors in the docs differs from the method signature, unlike other methods. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135698 Approved by: https://github.com/albanD Co-authored-by: Alexander Jipa <azzhipa@amazon.com>	2024-09-11 22:32:55 +00:00
Mayank Mishra	9a04cfbeff	fix for fp16 (#134106 ) This PR is a replacement for https://github.com/pytorch/pytorch/pull/133085 for pushing a quick fix for RMSNorm. The original author is @kkontny Previous PR summary: Since FP16 has quite small dynamic range it is very easy to overflow while computing `at::pow(input, 2)` , and it happens in real world computation. I've tried to use `nn.RMSNorm` fused implementation instead of `LlamaRMSNorm` inside `transformers` implementation of Llama (`src/transformers/models/llama/modeling_llama.py`). It started to give wrong answers in Fp16 while still giving good in FP32. I figured out happens due to overflow while computing square of the input tensor. Original `LLamaRMSNorm` implementation upcasts input to fp32 to prevent this and give better numerical stability. ``` class LlamaRMSNorm(nn.Module): def __init__(self, hidden_size, eps=1e-6): """ LlamaRMSNorm is equivalent to T5LayerNorm """ super().__init__() self.weight = nn.Parameter(torch.ones(hidden_size)) self.variance_epsilon = eps def forward(self, hidden_states): input_dtype = hidden_states.dtype hidden_states = hidden_states.to(torch.float32) variance = hidden_states.pow(2).mean(-1, keepdim=True) hidden_states = hidden_states * torch.rsqrt(variance + self.variance_epsilon) return self.weight * hidden_states.to(input_dtype) ``` Proposed commit fixed the issue. FP16 in RMSNorm has to be treated in special way, to be usable in real world implementations. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134106 Approved by: https://github.com/mikaylagawarecki, https://github.com/eqy	2024-09-11 22:02:07 +00:00
Shubham Bhokare	66db61f0d1	[ONNX] Update fake mode usage in onnx docs (#135512 ) Update fake mode usage in onnx docs Pull Request resolved: https://github.com/pytorch/pytorch/pull/135512 Approved by: https://github.com/justinchuby Co-authored-by: Justin Chu <justinchuby@users.noreply.github.com>	2024-09-11 21:29:04 +00:00
PyTorch MergeBot	c025f7becc	Revert "[Partitioner] Reuse partition to check whether nodes exist (#135317 )" This reverts commit e004d539da3335d97a8134c9081245628f18eb67. Reverted https://github.com/pytorch/pytorch/pull/135317 on behalf of https://github.com/izaitsevfb due to BC-breaking, breaks executorch and internal meta builds ([comment](https://github.com/pytorch/pytorch/pull/135317#issuecomment-2344730294))	2024-09-11 21:27:53 +00:00
FFFrog	8c4e1148b8	Refactoring byte_order (#135558 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135558 Approved by: https://github.com/mikaylagawarecki	2024-09-11 21:06:43 +00:00
Nikita Shulga	e20ee39558	Expand bitwise ops to unsigned types (#135525 ) Fixes https://github.com/pytorch/pytorch/issues/135436 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135525 Approved by: https://github.com/ezyang	2024-09-11 20:48:52 +00:00
Xinya Zhang	74fd1bf965	[ROCm] Update to AOTriton 0.7b (#134498 ) Notable changes: 1. Enable CudaGraph related tests 2. Fix UT problems 3. EXPERIMENTAL Navi31 support. User should enable Navi31 support with Env Var `TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1` Know Problem: 1. `test/test_transformers.py` will massive failures and/or NaN outputs with `--use-pytest` + Update: Confirmed skip `class TestSDPAPrivateUse1Only` can fix the problem with `--use-pytest` Note: AOTriton 0.7b adds support to nestedtenosrs+SDPA but need more work (and consequently a separate PR) to enable it. Fixes #133540 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134498 Approved by: https://github.com/pruthvistony, https://github.com/jeffdaily, https://github.com/malfet	2024-09-11 20:34:01 +00:00
Sidney Tsang	5d964a5eb7	[Export] Fix SDPA decomposition (#135297 ) Summary: Update SDPA decomposition to match updated stride from D62009189 which aligns strides with the `aten._scaled_dot_product_attention_math.default`, which makes `t.permute().continuous().permute()` no longer necessary. Test Plan: CI Differential Revision: D62278378 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135297 Approved by: https://github.com/drisspg	2024-09-11 20:21:59 +00:00
Bin Bao	118d7e1480	[Inductor] add _dynamo.reset to test_cat_slice_cat_cuda (#135694 ) Summary: test_cat_slice_cat_cuda runs inductor multiple times and check counters["inductor"] in between, and thus we need to reset properly. Differential Revision: D62500331 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135694 Approved by: https://github.com/masnesral	2024-09-11 20:07:11 +00:00
Bob Ren	dd47f6f623	Simplify expr before getting implications in _maybe_evaluate_static (#135499 ) Fixes #134268 Previously we weren't simplifying these expressions before calling get_implications, resulting in inconsistent application of FloorDiv/CleanDiv. See #134268 for more details. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135499 Approved by: https://github.com/ezyang	2024-09-11 19:48:29 +00:00
Tom Ritchford	e05ea2b179	Add decomposition for transpose_copy (#130943 ) * Extracted from #128416 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130943 Approved by: https://github.com/amjames, https://github.com/eellison	2024-09-11 19:45:22 +00:00
Shangdi Yu	ad75b09d89	Replace capture_pre_autograd_graph with export_for_training in torch tests (#135623 ) Summary: as title Test Plan: ``` buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test:test_export -- -r test_conv_dynamic buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test:fx -- -r matcher buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test/quantization:test_quantization -- -r x86 ``` CI Differential Revision: D62448302 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135623 Approved by: https://github.com/tugsbayasgalan	2024-09-11 19:23:08 +00:00
rzou	a2cb9b7331	Flip triton kernel default layout constraint to "needs_fixed_stride_order" (#135581 ) This is to match the default layout constraint for custom operators. By default, Inductor should match the stride order of inputs to a triton kernel. Test Plan: - existing tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/135581 Approved by: https://github.com/eellison ghstack dependencies: #135530	2024-09-11 18:43:18 +00:00
Edward Z. Yang	451eaf0ff2	Log full exception trace when error raised in Dynamo (#135697 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/135697 Approved by: https://github.com/Skylion007	2024-09-11 18:14:33 +00:00
Zain Rizvi	09519eb195	Support rolling over a percentage of workflows (#134816 ) In order to support adding a rollover percentage, this ended up being a complete rewrite of runner_determinator.py. Details of the new format are in the comments up top. On the plus side, this now includes some unit tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134816 Approved by: https://github.com/PaliC, https://github.com/zxiiro	2024-09-11 18:01:26 +00:00
Bob Ren	5314ae2660	Don't use exception chaining for BackendCompilerFailed (#135545 ) Commandeered from https://github.com/pytorch/pytorch/pull/135496 as I'm now helping @ezyang ship dynamic float arguments in PT2. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135545 Approved by: https://github.com/ezyang	2024-09-11 17:49:18 +00:00
Jack Taylor	da587de9cb	[ROCm] [BUGFIX] Re-enable rocm-specific tuning parameters v2 (#133852 ) Small bug fix - https://github.com/pytorch/pytorch/pull/124592 replaced the torch.version.hip with device_props but made a mistake in porting the original logic. The original code was: `if torch.version.hip is not None:` Which was incorrectly replaced by: `if self.device_props.type != "hip":` Another occurence of https://github.com/pytorch/pytorch/pull/130617 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133852 Approved by: https://github.com/masnesral, https://github.com/malfet	2024-09-11 17:21:40 +00:00
Jithun Nair	82a4df2d5f	[CI] [ROCm] Run rocm workflow on every push to main branch (#135644 ) Dial the frequency back up from https://github.com/pytorch/pytorch/pull/131637 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135644 Approved by: https://github.com/huydhn	2024-09-11 17:21:05 +00:00
Catherine Lee	18a9030952	[CI] Fix update slow tests (#135390 ) * Add pytorchbot to list of approvers for file * Add labels to the auto created PR The auto generated PR is currently not merging due to some failing tests on slow workflow that were supposed to be moved back to normal idk if this has much value, clearly we've been managing without the update Pull Request resolved: https://github.com/pytorch/pytorch/pull/135390 Approved by: https://github.com/ZainRizvi	2024-09-11 17:02:17 +00:00
Isuru Fernando	03f23d07b4	Optimize ShapeEnv.replace (#135652 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135652 Approved by: https://github.com/ezyang ghstack dependencies: #135621, #135622	2024-09-11 16:50:59 +00:00
Isuru Fernando	8c738c9270	Improve performance of sympy_generic_le (#135622 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135622 Approved by: https://github.com/ezyang ghstack dependencies: #135621	2024-09-11 16:20:03 +00:00
Isuru Fernando	7ddacaf40a	Improve performance of canonicalize_bool_expr (#135621 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135621 Approved by: https://github.com/ezyang	2024-09-11 16:20:03 +00:00
PyTorch MergeBot	183c32fd3b	Revert "[Dynamo] Trace torch function modes entered outside of torch.compile (#133137 )" This reverts commit 0d15122092c27fec1143b800bab7c996d126b547. Reverted https://github.com/pytorch/pytorch/pull/133137 on behalf of https://github.com/clee2000 due to something in this stack broke functorch/test_control_flow.py::TestControlFlow::test_scan_simple_graph [GH job link](https://github.com/pytorch/pytorch/actions/runs/10804912306/job/29980571390) [HUD commit link](`444b52ff40`), newly added test yesterday ([comment](https://github.com/pytorch/pytorch/pull/133137#issuecomment-2344054339))	2024-09-11 15:57:00 +00:00
PyTorch MergeBot	3ab12e2596	Revert "[Dynamo] Support thread local setattr (#135443 )" This reverts commit 160c228a4bd60ceffa62b045a6b0a6f9413835c5. Reverted https://github.com/pytorch/pytorch/pull/135443 on behalf of https://github.com/clee2000 due to something in this stack broke functorch/test_control_flow.py::TestControlFlow::test_scan_simple_graph [GH job link](https://github.com/pytorch/pytorch/actions/runs/10804912306/job/29980571390) [HUD commit link](`444b52ff40`), newly added test yesterday ([comment](https://github.com/pytorch/pytorch/pull/135443#issuecomment-2344042800))	2024-09-11 15:53:55 +00:00
PyTorch MergeBot	596e93b506	Revert "[dynamo] Bug fix for _torchdynamo_inline source handling (#135612 )" This reverts commit 5c3d0a2dedbc0e85f3b256ce56ac674078a5fae1. Reverted https://github.com/pytorch/pytorch/pull/135612 on behalf of https://github.com/clee2000 due to broke inductor/test_cpu_select_algorithm.py::TestSelectAlgorithmCPU::test_linear_input_transpose_bias_True_cpu_float32 [GH job link](https://github.com/pytorch/pytorch/actions/runs/10805518363/job/29982386304) [HUD commit link](`5c3d0a2ded`), bad TD ([comment](https://github.com/pytorch/pytorch/pull/135612#issuecomment-2344039370))	2024-09-11 15:51:12 +00:00
PyTorch MergeBot	f96e8041b1	Revert "[Dynamo] Simplify torch function mode stack guard (#135444 )" This reverts commit 444b52ff40cf4afce7bc3fdcf021a88eab3b954c. Reverted https://github.com/pytorch/pytorch/pull/135444 on behalf of https://github.com/clee2000 due to something in this stack broke functorch/test_control_flow.py::TestControlFlow::test_scan_simple_graph [GH job link](https://github.com/pytorch/pytorch/actions/runs/10804912306/job/29980571390) [HUD commit link](`444b52ff40`), newly added test yesterday ([comment](https://github.com/pytorch/pytorch/pull/135444#issuecomment-2344036843))	2024-09-11 15:48:27 +00:00
PyTorch MergeBot	7cf9c81918	Revert "[Dynamo] Use custom backend to reenter metadata tf mode when tracing while/cond (#134732 )" This reverts commit 6a3edfcc1e474e6ebd0c06624000a6d6bf1a0dee. Reverted https://github.com/pytorch/pytorch/pull/134732 on behalf of https://github.com/clee2000 due to broke functorch/test_control_flow.py::TestControlFlow::test_scan_simple_graph [GH job link](https://github.com/pytorch/pytorch/actions/runs/10804912306/job/29980571390) [HUD commit link](`444b52ff40`), newly added test yesterday ([comment](https://github.com/pytorch/pytorch/pull/134732#issuecomment-2344016694))	2024-09-11 15:39:21 +00:00
Sam Larsen	49e0b88aab	Fix test_triton_kernel_float64_constant (#135583 ) Summary: Landed https://github.com/pytorch/pytorch/pull/135260 too soon and the test in that PR doesn't do exactly what I tested (actually test different dtypes). Test Plan: `python test/inductor/test_triton_kernels.py -k float64_constant` Pull Request resolved: https://github.com/pytorch/pytorch/pull/135583 Approved by: https://github.com/isuruf, https://github.com/eellison, https://github.com/Skylion007	2024-09-11 15:16:23 +00:00
Pushpak Raj Gautam	ee8c5cc1cc	For S444023: Back out "deprecate `search_autotune_cache` (#133628 )" (#135186 ) Summary: For S444023 Test Plan: Revert prevented the NaN errors - f639391901 Training job ran for 7767 iterations. NaN errors show up within the first 1k. Reviewed By: nmacchioni Differential Revision: D62224747 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135186 Approved by: https://github.com/kit1980	2024-09-11 14:08:40 +00:00
Nikita Lutsenko	ce4d146f56	ATen \| Fix MPSCNNNeuron creation on Mac Catalyst. (#135595 ) Summary: These are still utilized directly when using relu/sigmoid/tanh tensors directly from here: https://fburl.com/code/k6n7ofzd However, on Mac Catalyst we always were returning `nil`, as such in most cases yielding the entire graph completely useless and most often just stray `MPSTemporaryImage` references that were never written into. This fixes the issue completely by making sure that we always return the valid kernels back, so they can be executed. Test Plan: Test with segmentation net that uses a combination of relu and other tensors together - run this via Mac Catalyst build - it works! {F1858576745} Reviewed By: MichaelTay Differential Revision: D62430010 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135595 Approved by: https://github.com/MichaelTay	2024-09-11 11:12:23 +00:00
Amadeusz Skrzypczak	0226fcaacf	Disable cuda specific restrictions in _scaled_mm for other devices (#135579 ) Fixes #135576 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135579 Approved by: https://github.com/drisspg	2024-09-11 11:05:38 +00:00
Yanbo Liang	4cde5096c4	[Inductor][FlexAttention] Supports dynamic shapes with block mask (#135629 ) Fixes #134560 and #135206 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135629 Approved by: https://github.com/drisspg	2024-09-11 08:10:50 +00:00
Ke Wen	443c015393	[Distributed] Improve efficiency of NaN checker (#135414 ) Some customers would like to run the NaN checks on the fly, so we are improving its efficiency. ## Benchmarking Allreduce 2G floats. `TORCH_NCCL_NAN_CHECK=1` Red kernel: ncclAllreduce Blue kernel: Nan check <img width="1093" alt="Screenshot 2024-09-06 at 10 00 05 PM" src="https://github.com/user-attachments/assets/5501bc31-024f-4115-adb2-dd66eb4025d3"> ## Comparison with torch ops: Let's say a user manually check for NaNs with the following torch ops before all-reduce: ``` torch.any(torch.isnan(x)) ``` <img width="1091" alt="Screenshot 2024-09-06 at 10 14 53 PM" src="https://github.com/user-attachments/assets/1f8b5f63-c955-4612-bb96-241b6c69959b"> So our perf is on-par with torch ops. ## Changes - Load from vidmem using "big packs" of 16 bytes - Bump `blockDim.x` from 256 to 512 - Separate loads and checks into two loops, each of 8 iterations - Unroll the loops - Templated functions for checking NaN in a "big pack" based on dtype Special thanks to @jbachan from NCCL! Pull Request resolved: https://github.com/pytorch/pytorch/pull/135414 Approved by: https://github.com/wconstab	2024-09-11 07:53:42 +00:00
Yiming Zhou	4ae6d7c18f	Back out "[pytorch][PR] [export] fix re-export custom metadata" (#135634 ) Summary: Broke some tests. Revert this diff Test Plan: CI Differential Revision: D62474337 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135634 Approved by: https://github.com/tugsbayasgalan	2024-09-11 06:16:26 +00:00
Eddie Yan	3084b7b5c0	[cuDNN][SDPA] Support `attn_bias` in cuDNN (#130482 ) CC @drisspg Pull Request resolved: https://github.com/pytorch/pytorch/pull/130482 Approved by: https://github.com/drisspg, https://github.com/Skylion007, https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2024-09-11 05:59:25 +00:00
Animesh Jain	5c3d0a2ded	[dynamo] Bug fix for _torchdynamo_inline source handling (#135612 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135612 Approved by: https://github.com/drisspg ghstack dependencies: #135588	2024-09-11 05:23:42 +00:00
fduwjj	c608b17f60	[PTD][BE][c10d] Add some code documents for TCPStore code and cosmetic changes to libUVStore code (#130496 ) While designing something else when TCPStore is needed. I spent some time digging into the codebase of TCPStore and found that the code is a little bit challenging to understand without proper documents. Although people from OSS community must be smarter than me, I still want to document my findings in the code so that devs and users can use them as a reference down the road. Also for libuv, we need to make private variables with a "_", so it's a pure renaming of private variables such as `tcpServer`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130496 Approved by: https://github.com/wconstab	2024-09-11 04:42:25 +00:00
Michael Lazos	444b52ff40	[Dynamo] Simplify torch function mode stack guard (#135444 ) The semantics of ignored modes previously had edge cases, this eliminates these by in essence filtering any ignored modes out of both the ref stack and the current torch function mode stack. This is purely to fix complexity in #135422. The ignored modes handling will be removed in a future PR after https://github.com/pytorch/pytorch/pull/135422 lands, since we will then trace through DeviceContexts vs inserting them into the graph which needed these extra workarounds for correctness. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135444 Approved by: https://github.com/anijain2305, https://github.com/williamwen42 ghstack dependencies: #134732, #133137, #135443	2024-09-11 04:18:22 +00:00
Michael Lazos	160c228a4b	[Dynamo] Support thread local setattr (#135443 ) In preparation for tracing through DeviceContext (`defb515306/torch/utils/_device.py (L66)`) This PR adds support for calling the setattr of thread local objects. These objects have a slots impl, and since this doesn't appear to have any side effects, we call this setattr impl when replaying mutations, since calling `object.__setattr__` on these objects results in a type error. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135443 Approved by: https://github.com/anijain2305 ghstack dependencies: #134732, #133137	2024-09-11 04:18:22 +00:00
Michael Lazos	0d15122092	[Dynamo] Trace torch function modes entered outside of torch.compile (#133137 ) This PR adds initial tracing for torch function modes. Details: In essence, this adds tracing into the torch function of modes entered outside of the torch.compile call. This does not yet support tracing enter/exit of a torch function mode/ tracing set_default_device properly using the new mode infra (this will be a very good stress test for modes). I am adding more PRs to this stack to support these. The overall plan is to support tracing enter/exit and handling graph breaks like we do other torch.* context managers. Previously landed: https://github.com/pytorch/pytorch/pull/133135 https://github.com/pytorch/pytorch/pull/133136 https://github.com/pytorch/pytorch/pull/133134 https://github.com/pytorch/pytorch/pull/133133 https://github.com/pytorch/pytorch/pull/133132 https://github.com/pytorch/pytorch/pull/133131 https://github.com/pytorch/pytorch/pull/133729 https://github.com/pytorch/pytorch/pull/133130 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133137 Approved by: https://github.com/jansel, https://github.com/zou3519 ghstack dependencies: #134732	2024-09-11 04:18:22 +00:00
Michael Lazos	6a3edfcc1e	[Dynamo] Use custom backend to reenter metadata tf mode when tracing while/cond (#134732 ) For tracing cond/while in eager, we trace the HOP with the eager backend with metadata torchfunction mode enabled. HOPs disallow the mutation that occurs in this torch function mode, so it is not able to be traced. As a result, we use a custom backend which enters this mode for tracing these HOPs. Thanks to @ydwu4 for the help with implementing this Pull Request resolved: https://github.com/pytorch/pytorch/pull/134732 Approved by: https://github.com/ydwu4	2024-09-11 04:18:22 +00:00
penguin-wwy	356f14e7b7	Fix the output of FileCheck when not run and add unit tests (#135345 ) When FileCheck is destructed without execution, it should output all rules. For example: ``` >>> fc = FileCheck().check("test") >>> del fc You have not run this instance of FileCheck! FileCheck checks: CHECK: test ``` Additionally, unit tests for the Python interface of FileCheck will be added. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135345 Approved by: https://github.com/eellison	2024-09-11 04:13:24 +00:00
Sathyanarayanan Saravanamuthu	34dc8f69a1	Adding entry-point based support for out-of-tree rendezvous plugins (#132633 ) Fixes #127519 Currently in torchrun rendezvous, there are only two rendezvous backends supported out of the box: `C10d` and `Etcd`. The changes in this PR enables the distributed elastic users to bring their out-of-tree rendezvous backend implementations as Python packages. #### AUTHORING NEW PLUGIN Any new plugin will be a python package exposing entry-points. For example, the structure of redis plugin is as follows: ``` plugin_root \|_ pyproject.toml \|_ src \|_ redis \|_ __init__.py \|_ redis_store.py \|_ redis_backend.py ``` The contents of the `pyproject.toml` should indicate that this is exposes a torchrun entry-point by mentioning the group name `torchrun.plugins`. The `pyproject.toml` for redis plugin would be as follows: ``` [project] name = "redis" version = "0.0.1" [project.entry-points.'torchrun.plugins'] redis = 'redis' ``` The `src/redis/__init__.py` file would contain functions that return the plugin name and plugin handler. The contents of `__init__.py` for redis would be as follows: ``` def getPluginHandler(): def _create_redis_handler(params: RendezvousParameters): from redis_rendezvous_backend import create_backend backend, store = create_backend(params) return create_handler(store, backend, params) return _create_redis_handler ``` The files `redis_store` and `redis_backend` contain the implementation of [Store](`41189b0da4/torch/_C/_distributed_c10d.pyi (L171)`) and [RendezvousBackend](`e782918b8e/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py (L61)`) respectively. #### USER EXPERIENCE Before using the plugin for the first time, the user has to install the plugin packages. For example, the published packages can be installed using `pip3 install <plugin-name>` and the plugin is in local file systemcan be installed using `pip3 install -e <plugin-location>`. Once installed, the new backend can be used in torchrun as follows: ``` torchrun --rdzv-backend=redis --rdzv-endpoint=redis-container:6379 --nnodes=3 --nproc-per-node=1 --max-restarts=3 --rdzv-id=1 test.py ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/132633 Approved by: https://github.com/fduwjj	2024-09-11 03:35:02 +00:00
angelayi	cd9ee49a69	[aoti] Add cpp loader (#135374 ) * Added a cpp loader, AOTIModelPackageLoader, which can load the .pt2, build the .so, and create a runner. The python-facing API is that users can directly call the `run` function, whereas in cpp users can directly access the `runner_` if they are more familiar with that. I couldn't figure out how to bind the `get_runner()` function to python... * Added a new config, `aot_inductor.package_cpp_only` which will not package the so. This means that whenever the package is loaded, we will need to build the so. This is turned off by default so that new environments do not need to rebuild their so. The `package_cpp_only` is a feature which torchchat intends to use to provide flexibility to users. * Added a new config, `aot_inductor.metadata` which stores user-provided metadata, serialized to the pt2 as a json file. It also stores the device used when exporting, "cuda" or "cpu", so that during load time, we can use that data to determine which AOTIModelContainerRunner to use. The metadata can be accessed through `loader.get_metadata()`. TODO is to move this metadata to the toplevel `package_aoti` function so that we can remove the metadata as a config. * Separated out `package_aoti` as a standalone function, instead of it automatically being called in inductor. This is to prepare for the case where users will compile multiple models, and want to bundle it in one package. The specific use case is in torchchat, where we want to package the separately-exported encoder and decoder layers. An example of how to use this is in `test_multiple_methods`. * `load_package` will load a singular model, given the model name. * The loader doesn't support windows for now, I think I need to add some more casing to make the build commands work on windows? Differential Revision: [D62329906](https://our.internmc.facebook.com/intern/diff/D62329906) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135374 Approved by: https://github.com/desertfire, https://github.com/malfet	2024-09-11 03:00:01 +00:00
chuanqiw	26e5572dd2	Bump triton xpu pin and release version (#135638 ) Similar with https://github.com/pytorch/pytorch/pull/135627 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135638 Approved by: https://github.com/atalman	2024-09-11 00:56:15 +00:00
Animesh Jain	693897df42	[dynamo] Missing guard source keys for corner case of NNModuleVariabl… (#135041 ) Potentially fixes - https://fb.workplace.com/groups/1286739428954016/permalink/1319662695661689/ Pull Request resolved: https://github.com/pytorch/pytorch/pull/135041 Approved by: https://github.com/ezyang	2024-09-11 00:43:26 +00:00
Nikita Shulga	3bf6be457d	[MPS] Add missing dispatch to rshift.Tensor (#135607 ) Missed it while working on https://github.com/pytorch/pytorch/pull/131813 Test plan: `python -c "import torch;print(torch.randint(100, 500, (64,), device='mps') >> torch.tensor([3,], device='mps'))"` Pull Request resolved: https://github.com/pytorch/pytorch/pull/135607 Approved by: https://github.com/manuelcandales	2024-09-11 00:20:53 +00:00
titaiwangms	492f064f15	[ONNX] Add assertion nodes to ignoring list (#135591 ) Fixes #135419 PS: there are 104 empty output nodes, I suggest we add them one by one when we run into them. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135591 Approved by: https://github.com/justinchuby	2024-09-11 00:18:17 +00:00
rzou	29408ea81a	Add option to tweak inductor stride settings for user-defined triton kernels (#135530 ) Previously, Inductor was allowed to modify the stride/storage_offset (layout) for inputs to user-defined triton kernels. This can cause silent incorrectness because most triton kernels are written for a specific striding pattern (usually contiguous). This PR adds a config to allow the user to choose Inductor's behavior on this. The options are: - "flexible_layout" (default): Inductor can modify the layout for inputs to user-defined triton kernels as much as it wants. - "needs_fixed_stride_order": Inductor must preserve the stride order (when compared to tracing) for inputs to user-defined triton kernels. This matches our handling for custom operators. In the future, we'll want a "needs_exact_strides" option (this is the safest option). Test Plan: - new test Pull Request resolved: https://github.com/pytorch/pytorch/pull/135530 Approved by: https://github.com/FindHao, https://github.com/oulgen	2024-09-11 00:11:17 +00:00
Haoming Lu	02dcb07765	Add boolean support in pack segments ops for both cpu and cuda impls (#132897 ) (#135620 ) Summary: Same as int types, forward only. bypass-github-export-checks diff has been synced to github Test Plan: buck test mode/dev-nosan //caffe2/torch/fb/sparsenn:test -- test_pack_segments https://www.internalfb.com/intern/testinfra/testconsole/testrun/16888498646804437/ Reviewed By: garroud Differential Revision: D60785563 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135620 Approved by: https://github.com/kit1980 Co-authored-by: Haoming Lu <haominglu@meta.com>	2024-09-11 00:03:17 +00:00
Animesh Jain	5c38aa72c0	[dynamo][dicts][nv-embed] Support update with kwargs (#135588 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135588 Approved by: https://github.com/yanboliang	2024-09-10 23:50:23 +00:00
atalman	5134ba7458	Bump triton pin and release version (#135627 ) Update the pin and release version to sync with https://github.com/triton-lang/triton/tree/release/3.1.x Pull Request resolved: https://github.com/pytorch/pytorch/pull/135627 Approved by: https://github.com/Chillee, https://github.com/drisspg, https://github.com/malfet	2024-09-10 23:46:36 +00:00
titaiwangms	e48ee2cf50	[ONNX] Fix scaled_dot_product_attention with float scale (#135594 ) Fixes #125158 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135594 Approved by: https://github.com/justinchuby	2024-09-10 23:04:02 +00:00
hongxyan	eb38ee21ba	[ROCm] slow torch.sum optimization by increasing max_values_per_thread in reduce config (#135397 ) Fixes #132964 This change is to optimize torch.sum() performance by increasing max_values_per_thread in setReduceConfig() for ROCm platform. By increasing this parameter, it uses fewer threadblocks and improved the performance. Test: Tested on MI300x and H100, and now the MI300x perf improved to 3205GByte/s from ~1690GByte/s for the test case and is slightly better than H100 (3136GByte/s). Also tested with other different sizes of tensors and also see perf improvement. ```python import torch from triton.testing import do_bench x = torch.randn(2*30, device='cuda') ms = do_bench(lambda: x.sum(dim=-1)) bandwidth_gbyte = x.numel() x.dtype.itemsize / (10**9) time_s = ms / 1000 bw_per_second = bandwidth_gbyte / time_s print(bw_per_second) ``` Co-author: @carlobertolli Pull Request resolved: https://github.com/pytorch/pytorch/pull/135397 Approved by: https://github.com/eqy, https://github.com/malfet	2024-09-10 21:03:01 +00:00
Shunting Zhang	8057b72763	[ez][inductor] don't benchmark cloning if there are no mutated args (#135533 ) When a kernel does not have mutated args (this is quite common?), benchmarking the cost of cloning actually benchmarks a no-op. This still takes >100ms since triton.testing.do_bench will allocate 100 ms budget to run the kernel. Skipping this benchmarking can save quite some compilation time if the code path is hit multiple times. Let's say, if the code path is hit 100 times when the graph is large, we would save >10s. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135533 Approved by: https://github.com/jansel ghstack dependencies: #135531	2024-09-10 20:54:31 +00:00
Shunting Zhang	7b17918dc9	[inductor] fix a device sync issue for benchmarking fusion (#135531 ) Fix https://github.com/pytorch/pytorch/issues/134768 . When we benchmark the latency for a fused node set, we do benchmarking twice: 1. benchmark the latency of the kernel including cloning mutated args 2. benchmark the latency of cloning mutated args without running the kernel We subtract result 2 from result 1 to get the latency of the kernel itself. But when the tensors are not on the cuda device 0, we get equal number for result 1 and result 2 no matter how much work the kernel does. The root cause is, in `triton.testing.do_bench` the `torch.cuda.synchronize` call sync the current cuda device (which is device 0 if it's not overriden). But since the tensors and kernels are located on another device, the sync actually does nothing (unless there happens to be other kernels on the device 0). The fix is to set the correct current device in our benchmarking code. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135531 Approved by: https://github.com/jansel	2024-09-10 20:54:31 +00:00
Yiming Zhou	66c45f3ed9	[export] fix re-export custom metadata (#135282 ) Fixes #134778 When a model is exported and debug handles are added to the "custom" field of non-placeholder and non-output nodes in the graph, re-exporting it will change the metadata of placeholder nodes (the "custom" field will be added or copied to these nodes, depending whether `ExportedProgram` or `ExportedProgram.module()` is passed to `generate_numeric_debug_handle()`). This occurs because when we re-export the model, `placeholder` nodes are unlifted to `get_attr` nodes. These nodes remain as `get_attr` after being exported to `gm_torch_level`. Their metadata are modified [here](https://github.com/pytorch/pytorch/blob/main/torch/export/_trace.py#L1347) based on `params_buffers_to_node_meta` which is collected [here](https://github.com/pytorch/pytorch/blob/main/torch/export/_trace.py#L1312). Pull Request resolved: https://github.com/pytorch/pytorch/pull/135282 Approved by: https://github.com/jerryzh168, https://github.com/zhxchen17, https://github.com/tugsbayasgalan	2024-09-10 20:15:02 +00:00
PyTorch MergeBot	0a9d55d2ee	Revert "[AOTI] Fix assert_function call in cpu autotune template (#135086 )" This reverts commit 16c3b8f87cfa9cb5acee8104820baa389e7ee2bd. Reverted https://github.com/pytorch/pytorch/pull/135086 on behalf of https://github.com/izaitsevfb due to breaks internal tests, see D62405818 ([comment](https://github.com/pytorch/pytorch/pull/135086#issuecomment-2341889428))	2024-09-10 19:51:16 +00:00
Catherine Lee	4ca65d3323	[CI] Increase sharding for jobs that are timing out (#135582 ) Increase sharding for * slow grad check * slow cuda tests slow / linux-focal-cuda12.1-py3.10-gcc9-sm86 / test * avx Pull Request resolved: https://github.com/pytorch/pytorch/pull/135582 Approved by: https://github.com/huydhn, https://github.com/malfet	2024-09-10 19:45:13 +00:00
Andrew Gu	c932b39739	[FSDP2] Added `_set_unshard_async_op` (#135523 ) This PR adds a private API `_set_unshard_async_op` that allows for running pre-forward and pre-backward all-gathers using the `async_op=True` path so that all-gather allocations happen in the default stream to avoid inter-stream fragmentation. If using this option, forward requires explicit prefetching e.g. via the `unshard(async_op=True)` API for overlap. fp32 -> bf16 casts and the all-gather copy-in will not overlap with compute. Differential Revision: [D62401551](https://our.internmc.facebook.com/intern/diff/D62401551) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135523 Approved by: https://github.com/weifengpy	2024-09-10 19:28:02 +00:00
Rachel Guo	1f15973657	[AOTI][Tooling][7/n] Add debug printing support for JIT inductor codegen path as well (#135285 ) Summary: 1. Add the debug printer call to a level lower for triton kernel python wrapper codegen path 2. Add `torch.save()` for jit inductor as well 3. This also fixes the issue introduced in D61949020 (at python wrapper code level for triton kernel not printing) Test Plan: ``` AOT_INDUCTOR_DEBUG_INTERMEDIATE_VALUE_PRINTER=1 TORCHINDUCTOR_FORCE_DISABLE_CACHES=1 TORCHINDUCTOR_ABI_COMPATIBLE=1 TORCH_COMPILE_DEBUG=1 TORCH_LOGS="+graph, inductor, +schedule, output_code" buck2 run -c fbcode.enable_gpu_sections=true -c fbcode.nvcc_arch=h100 @//mode/opt fbcode//caffe2/test/inductor:test_aot_inductor -- -r test_addmm_abi_compatible_cuda ``` Differential Revision: D62272588 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135285 Approved by: https://github.com/chenyang78	2024-09-10 19:24:58 +00:00
Dan Zimmerman	fc88ba260f	[amdsmi][torch] Update amdsmi API usages (#135504 ) Summary: In ROCm 6.2.0 there were API name changes-- we check if the new APIs exist and use them in this diff; see `7b2463abe0` for the changes Test Plan: CI Differential Revision: D62325661 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135504 Approved by: https://github.com/eqy, https://github.com/houseroad	2024-09-10 19:15:39 +00:00
Sam Larsen	bf8d0e3107	[inductor] Enable subprocess parallel compile internally with killswitch (#132467 ) Differential Revision: [D60629630](https://our.internmc.facebook.com/intern/diff/D60629630) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132467 Approved by: https://github.com/eellison	2024-09-10 19:05:46 +00:00
Shivam Raikundalia	3a1239a248	[Profiler] Harden Record Function Kwargs (#135365 ) Summary: In S445839, we had HTA break because of the "stream" parameter that was added to gpu traces. This brought up discussions regarding hardening our post processing of said inputs as to not break JSON schema as well as downstream tools. For this reason, this diff does the following. 1. Only allow int, double, bool and string values to be processed as kwinputs for JSON output. We can handle lists if needed in the future. 2. Make sure that any boolean is lowercase when a string so that the JSON does not break when parsing it 3. Force stream parameter to be an int Test Plan: Added unit tests to ensure that the list of requirements above is true for kwargs only. Differential Revision: D62304843 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135365 Approved by: https://github.com/aaronenyeshi	2024-09-10 18:44:05 +00:00
Sam Larsen	4f9f1775d8	Fix flaky TestCudaWrapper.test_randint_cuda_cuda_wrapper (#135370 ) Summary: This test is flaky when run after `test_dynamic_shapes_persistent_reduction_mixed_x_dim_cuda_cuda_wrapper` because the TestCase sets config options globally in its setUp() that stick around for subsequent tests. For test isolation, we use a contextlib.ExitStack pattern in other tests to patch the config options and restore them in tearDown(). Update all TestCases in `test/inductor/test_combo_kernels.py` to use that pattern. Test Plan: ``` python test/inductor/test_combo_kernels.py python test/inductor/test_cuda_cpp_wrapper.py TestCudaWrapper.test_dynamic_shapes_persistent_reduction_mixed_x_dim_cuda_cuda_wrapper TestCudaWrapper.test_randint_cuda_cuda_wrapper ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/135370 Approved by: https://github.com/jansel	2024-09-10 18:43:14 +00:00
Thanh Ha	5e0788befb	Migrate remaining jobs to use runner determinator (#134867 ) At this point all self-hosted runner jobs should be using the runner determinator to switch between LF and Meta runners. This change updates the remaining jobs that have not yet been migrated over. Issue: https://lf-pytorch.atlassian.net/browse/PC-25 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134867 Approved by: https://github.com/ZainRizvi	2024-09-10 18:14:00 +00:00
Ivan Zaitsev	440f8f57af	Revert "[fx] Bypass custom __setattr__ in Node.__init__ (#135079 )" (#135562 ) This reverts commit 66da3b3b2acacb116a9b23e91b24934830eaf6b8. #135079 breaks internal tests and needs to be reverted. Revert with mergebot doesn't work as this PR is technically part of the stack, but, according to @jansel, it should be possible to revert it individually. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135562 Approved by: https://github.com/jansel, https://github.com/seemethere	2024-09-10 18:07:11 +00:00
Zhou, Lingzhi	e004d539da	[Partitioner] Reuse partition to check whether nodes exist (#135317 ) The time complexity of find node whether in NodeList is O(n). Reuse partition to speed up due to partition.nodes is hash table and has same elements. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135317 Approved by: https://github.com/ezyang	2024-09-10 17:45:29 +00:00
Zixi Qi	c4b84a46a9	Add more logging to TunableOp validators (#135396 ) Summary: Add more logging to TunableOp validators Test Plan: Verified additional logging when loading kernel selections: ``` ROCBLAS_VERSION validation: expect 4.0.0-72e57364-dirty to match 4.0.0-72e57364-dirty GCN_ARCH_NAME validation: expect gfx942:sramecc+:xnack- to match gfx942:sramecc+:xnack- HIPBLASLT_VERSION validation: expect 800-a15e4178 to match 800-a15e4178 ROCM_VERSION validation: expect 6.0.0.0-12969-1544e39 to match 6.0.0.0-12969-1544e39 PT_VERSION validation: expect 2.5.0 to match 2.5.0 ``` ``` [qizixi@devgpu039.atn3 /data/users/qizixi/fbsource/fbcode (f9305317d\|remote/master)]$ PYTORCH_TUNABLEOP_VERBOSE=1 buck2 run mode/{opt,amd-gpu} -c fbcode.e nable_gpu_sections=true //scripts/xdwang/example:fc_llama -- --enable-tuning File changed: fbcode//hipblas_tuning_pt_llama0.csv Buck UI: https://www.internalfb.com/buck2/1ed2fac4-743e-49ef-805f-7fb6b9300022 Network: Up: 0B Down: 0B Jobs completed: 4189. Time elapsed: 0.2s. BUILD SUCCEEDED Enabled tuning - Run Linear (matmul) 2 x 1280 x 8192, dtype = torch.bfloat16 INFO:2024-09-06 14:38:07 2834864:2835138 CuptiActivityProfiler.cpp:260] HIP versions. Roctracer: 4.1; Runtime: 60032830; Driver: 60032830 INFO:2024-09-06 14:38:07 2834864:2836083 DynoConfigLoader.cpp:61] Setting communication fabric enabled = 0 reading tuning results from hipblas_tuning_pt_llama0.csv Validator PT_VERSION=2.5.0 Validator ROCM_VERSION=6.0.0.0-12969-1544e39 Validator HIPBLASLT_VERSION=800-a15e4178 Validator GCN_ARCH_NAME=gfx942:sramecc+:xnack- Validator ROCBLAS_VERSION=4.0.0-72e57364-dirty ROCBLAS_VERSION validation: expect 4.0.0-72e57364-dirty to match 4.0.0-72e57364-dirty GCN_ARCH_NAME validation: expect gfx942:sramecc+:xnack- to match gfx942:sramecc+:xnack- HIPBLASLT_VERSION validation: expect 800-a15e4178 to match 800-a15e4178 ROCM_VERSION validation: expect 6.0.0.0-12969-1544e39 to match 6.0.0.0-12969-1544e39 PT_VERSION validation: expect 2.5.0 to match 2.5.0 Loading results Avg time: 13.165860176086426 us, Achieved 3.19 TFLOPS, 1598.24 GB/s - Run Linear (matmul) 2 x 8192 x 1024, dtype = torch.bfloat16 Avg time: 13.230760097503662 us, Achieved 2.54 TFLOPS, 1271.14 GB/s - Run Linear (matmul) 2 x 7168 x 8192, dtype = torch.bfloat16 Avg time: 26.804399490356445 us, Achieved 8.76 TFLOPS, 4384.90 GB/s - Run Linear (matmul) 2 x 8192 x 3584, dtype = torch.bfloat16 Avg time: 13.407809734344482 us, Achieved 8.76 TFLOPS, 4384.14 GB/s 2x1280x8192-torch.bfloat16,13.165860176086426,3.18574247630113,1598.237845349412 2x8192x1024-torch.bfloat16,13.230760097503662,2.536092541374924,1271.1420867780075 2x7168x8192-torch.bfloat16,26.804399490356445,8.762778814892096,4384.9040543618985 2x8192x3584-torch.bfloat16,13.407809734344482,8.759112362638383,4384.138585247748 ``` Reviewed By: leitian Differential Revision: D62322830 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135396 Approved by: https://github.com/eqy	2024-09-10 17:20:59 +00:00
cyy	bc1b8f094d	Check function declarations of Core ML code (#135467 ) Relax the restrictions. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135467 Approved by: https://github.com/ezyang	2024-09-10 16:05:22 +00:00
rzou	f65a564fa2	[inductor] Flip custom_op_default_layout_constraint (#135239 ) By default, Inductor should respect the stride order of input Tensors to custom operators. Test Plan: - new tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/135239 Approved by: https://github.com/albanD ghstack dependencies: #135391	2024-09-10 14:27:43 +00:00
Edward Z. Yang	386b313028	Handle KeyError for compiler collective in scalars too (#135385 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/135385 Approved by: https://github.com/jansel	2024-09-10 12:33:04 +00:00
torotoki	6d7cbc20d2	Add dynamo itertools.pairwise support (#135416 ) Fixes #133766 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135416 Approved by: https://github.com/XuehaiPan, https://github.com/jansel Co-authored-by: Xuehai Pan <XuehaiPan@pku.edu.cn>	2024-09-10 11:37:59 +00:00
xinan.lin	ca16956b20	[Inductor] Generalize device guard codegen for cpp_wrapper mode. (#134761 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134761 Approved by: https://github.com/jansel, https://github.com/EikanWang ghstack dependencies: #134693	2024-09-10 10:11:52 +00:00
xinan.lin	67735d1ee8	[Inductor] Generalize `is_cuda` to specific device_type to make cpp_wrapper mode be extensible (#134693 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134693 Approved by: https://github.com/ezyang, https://github.com/EikanWang, https://github.com/jansel	2024-09-10 10:11:13 +00:00
Boyuan Feng	6e13f5eb38	[FlexAttention] Add broadcast support for kv batch dimension (#135505 ) This PR adds broadcast support for KV batch dimension. ## Details Consider Q of shape `[Bq, Hq, Q_LEN, D]`, and K, V of shape `[Bkv, Hkv, KV_LEN, D]`. Prior to this diff, we require `Bq == Bkv`. However, for some use cases, we may have Bkv < Bq. For example, in paged attention, we provide K, V of shape `[1, Hkv, MAX_LEN, D]`, while still providing Q of shape `[Bq, Hq, Q_LEN, D]`. Here, MAX_LEN is the maximal number of tokens supported by paged attention. This PR relax this requirement to be `Bq == Bkv or (Bq > 1 and Bkv == 0)`. This support covers both flex decoding, flex attention forward and backward. ## Benchmark GPU: H100 We see negligible (1%~2%) performance change from this PR when `Bq == Bkv`. ``` python benchmarks/transformer/score_mod.py --calculate-bwd ``` ### Perf before this PR FWD \| Type \| Speedup \| score_mod \| mask_mod \| dtype \| shape(B,Hq,M,Hkv,N,D) \| \|---------\|-----------\|---------------\|------------\|----------------\|------------------------------\| \| Average \| 0.743 \| \| \| \| \| \| Max \| 0.955 \| head_bias \| None \| torch.bfloat16 \| (2, 16, 1024, 2, 1024, 64) \| \| Min \| 0.548 \| relative_bias \| None \| torch.bfloat16 \| (16, 16, 1024, 2, 1024, 128) \| BWD \| Type \| Speedup \| score_mod \| mask_mod \| dtype \| shape(B,Hq,M,Hkv,N,D) \| \|---------\|-----------\|-------------\|------------\|----------------\|-----------------------------\| \| Average \| 0.834 \| \| \| \| \| \| Max \| 1.261 \| head_bias \| None \| torch.bfloat16 \| (8, 16, 512, 16, 512, 64) \| \| Min \| 0.456 \| None \| causal \| torch.bfloat16 \| (2, 16, 1024, 2, 1024, 128) \| <details> <summary> Full performance sweep </summary> \| score_mod \| mask_mod \| dtype \| shape(B,Hq,M,Hkv,N,D) \| fwd_eager_time \| fwd_compiled_time \| bwd_eager_time \| bwd_compiled_time \| fwd_speedup \| bwd_speedup \| \|---------------\|------------\|----------------\|-------------------------------\|------------------\|---------------------\|------------------\|---------------------\|---------------\|---------------\| \| None \| None \| torch.bfloat16 \| (2, 16, 512, 16, 512, 64) \| 15.264 \| 17.184 \| 107.040 \| 140.800 \| 0.888 \| 0.760 \| \| None \| causal \| torch.bfloat16 \| (2, 16, 512, 16, 512, 64) \| 15.840 \| 19.744 \| 112.576 \| 140.064 \| 0.802 \| 0.804 \| \| relative_bias \| None \| torch.bfloat16 \| (2, 16, 512, 16, 512, 64) \| 15.232 \| 17.344 \| 87.744 \| 142.496 \| 0.878 \| 0.616 \| \| head_bias \| None \| torch.bfloat16 \| (2, 16, 512, 16, 512, 64) \| 15.264 \| 17.184 \| 108.192 \| 143.328 \| 0.888 \| 0.755 \| \| None \| None \| torch.bfloat16 \| (2, 16, 512, 16, 512, 128) \| 19.904 \| 22.400 \| 106.432 \| 136.512 \| 0.889 \| 0.780 \| \| None \| causal \| torch.bfloat16 \| (2, 16, 512, 16, 512, 128) \| 19.424 \| 26.752 \| 91.712 \| 106.688 \| 0.726 \| 0.860 \| \| relative_bias \| None \| torch.bfloat16 \| (2, 16, 512, 16, 512, 128) \| 19.808 \| 22.432 \| 89.024 \| 101.920 \| 0.883 \| 0.873 \| \| head_bias \| None \| torch.bfloat16 \| (2, 16, 512, 16, 512, 128) \| 19.840 \| 22.272 \| 88.896 \| 102.592 \| 0.891 \| 0.867 \| \| None \| None \| torch.bfloat16 \| (2, 16, 1024, 16, 1024, 64) \| 30.240 \| 32.416 \| 116.768 \| 112.256 \| 0.933 \| 1.040 \| \| None \| causal \| torch.bfloat16 \| (2, 16, 1024, 16, 1024, 64) \| 29.536 \| 37.024 \| 113.664 \| 102.688 \| 0.798 \| 1.107 \| \| relative_bias \| None \| torch.bfloat16 \| (2, 16, 1024, 16, 1024, 64) \| 30.656 \| 32.800 \| 116.992 \| 127.008 \| 0.935 \| 0.921 \| \| head_bias \| None \| torch.bfloat16 \| (2, 16, 1024, 16, 1024, 64) \| 30.592 \| 32.480 \| 116.928 \| 112.160 \| 0.942 \| 1.043 \| \| None \| None \| torch.bfloat16 \| (2, 16, 1024, 16, 1024, 128) \| 40.448 \| 61.920 \| 198.656 \| 204.512 \| 0.653 \| 0.971 \| \| None \| causal \| torch.bfloat16 \| (2, 16, 1024, 16, 1024, 128) \| 37.760 \| 62.528 \| 189.536 \| 170.624 \| 0.604 \| 1.111 \| \| relative_bias \| None \| torch.bfloat16 \| (2, 16, 1024, 16, 1024, 128) \| 40.896 \| 62.368 \| 198.304 \| 205.824 \| 0.656 \| 0.963 \| \| head_bias \| None \| torch.bfloat16 \| (2, 16, 1024, 16, 1024, 128) \| 40.448 \| 61.952 \| 198.432 \| 203.648 \| 0.653 \| 0.974 \| \| None \| None \| torch.bfloat16 \| (2, 16, 4096, 16, 4096, 64) \| 318.528 \| 355.904 \| 947.232 \| 1162.496 \| 0.895 \| 0.815 \| \| None \| causal \| torch.bfloat16 \| (2, 16, 4096, 16, 4096, 64) \| 199.776 \| 252.128 \| 677.792 \| 813.184 \| 0.792 \| 0.834 \| \| relative_bias \| None \| torch.bfloat16 \| (2, 16, 4096, 16, 4096, 64) \| 316.512 \| 363.328 \| 947.712 \| 1361.984 \| 0.871 \| 0.696 \| \| head_bias \| None \| torch.bfloat16 \| (2, 16, 4096, 16, 4096, 64) \| 317.984 \| 356.864 \| 947.264 \| 1165.024 \| 0.891 \| 0.813 \| \| None \| None \| torch.bfloat16 \| (2, 16, 4096, 16, 4096, 128) \| 446.656 \| 734.656 \| 1664.288 \| 2172.960 \| 0.608 \| 0.766 \| \| None \| causal \| torch.bfloat16 \| (2, 16, 4096, 16, 4096, 128) \| 278.688 \| 467.648 \| 1182.624 \| 1339.296 \| 0.596 \| 0.883 \| \| relative_bias \| None \| torch.bfloat16 \| (2, 16, 4096, 16, 4096, 128) \| 447.872 \| 744.096 \| 1662.944 \| 2196.544 \| 0.602 \| 0.757 \| \| head_bias \| None \| torch.bfloat16 \| (2, 16, 4096, 16, 4096, 128) \| 448.128 \| 732.928 \| 1663.072 \| 2156.800 \| 0.611 \| 0.771 \| \| None \| None \| torch.bfloat16 \| (2, 16, 512, 2, 512, 64) \| 15.648 \| 16.640 \| 107.520 \| 143.008 \| 0.940 \| 0.752 \| \| None \| causal \| torch.bfloat16 \| (2, 16, 512, 2, 512, 64) \| 15.776 \| 18.240 \| 129.056 \| 141.920 \| 0.865 \| 0.909 \| \| relative_bias \| None \| torch.bfloat16 \| (2, 16, 512, 2, 512, 64) \| 15.168 \| 16.640 \| 103.616 \| 139.648 \| 0.912 \| 0.742 \| \| head_bias \| None \| torch.bfloat16 \| (2, 16, 512, 2, 512, 64) \| 15.616 \| 16.640 \| 128.608 \| 164.448 \| 0.938 \| 0.782 \| \| None \| None \| torch.bfloat16 \| (2, 16, 512, 2, 512, 128) \| 19.776 \| 21.952 \| 125.344 \| 170.304 \| 0.901 \| 0.736 \| \| None \| causal \| torch.bfloat16 \| (2, 16, 512, 2, 512, 128) \| 19.776 \| 23.712 \| 104.288 \| 196.896 \| 0.834 \| 0.530 \| \| relative_bias \| None \| torch.bfloat16 \| (2, 16, 512, 2, 512, 128) \| 19.072 \| 21.952 \| 102.080 \| 177.056 \| 0.869 \| 0.577 \| \| head_bias \| None \| torch.bfloat16 \| (2, 16, 512, 2, 512, 128) \| 19.648 \| 21.920 \| 109.920 \| 170.848 \| 0.896 \| 0.643 \| \| None \| None \| torch.bfloat16 \| (2, 16, 1024, 2, 1024, 64) \| 30.464 \| 31.936 \| 127.808 \| 228.832 \| 0.954 \| 0.559 \| \| None \| causal \| torch.bfloat16 \| (2, 16, 1024, 2, 1024, 64) \| 29.472 \| 33.856 \| 113.152 \| 215.072 \| 0.871 \| 0.526 \| \| relative_bias \| None \| torch.bfloat16 \| (2, 16, 1024, 2, 1024, 64) \| 30.496 \| 32.160 \| 116.576 \| 231.744 \| 0.948 \| 0.503 \| \| head_bias \| None \| torch.bfloat16 \| (2, 16, 1024, 2, 1024, 64) \| 30.464 \| 31.904 \| 116.320 \| 229.824 \| 0.955 \| 0.506 \| \| None \| None \| torch.bfloat16 \| (2, 16, 1024, 2, 1024, 128) \| 40.480 \| 61.440 \| 176.448 \| 345.312 \| 0.659 \| 0.511 \| \| None \| causal \| torch.bfloat16 \| (2, 16, 1024, 2, 1024, 128) \| 38.304 \| 59.424 \| 169.312 \| 371.360 \| 0.645 \| 0.456 \| \| relative_bias \| None \| torch.bfloat16 \| (2, 16, 1024, 2, 1024, 128) \| 40.960 \| 61.760 \| 176.512 \| 358.912 \| 0.663 \| 0.492 \| \| head_bias \| None \| torch.bfloat16 \| (2, 16, 1024, 2, 1024, 128) \| 40.352 \| 61.696 \| 176.512 \| 344.928 \| 0.654 \| 0.512 \| \| None \| None \| torch.bfloat16 \| (2, 16, 4096, 2, 4096, 64) \| 316.224 \| 357.728 \| 905.728 \| 1668.448 \| 0.884 \| 0.543 \| \| None \| causal \| torch.bfloat16 \| (2, 16, 4096, 2, 4096, 64) \| 199.904 \| 248.416 \| 636.544 \| 1109.088 \| 0.805 \| 0.574 \| \| relative_bias \| None \| torch.bfloat16 \| (2, 16, 4096, 2, 4096, 64) \| 314.880 \| 363.616 \| 906.304 \| 1658.176 \| 0.866 \| 0.547 \| \| head_bias \| None \| torch.bfloat16 \| (2, 16, 4096, 2, 4096, 64) \| 316.160 \| 354.368 \| 906.080 \| 1649.024 \| 0.892 \| 0.549 \| \| None \| None \| torch.bfloat16 \| (2, 16, 4096, 2, 4096, 128) \| 446.912 \| 739.840 \| 1555.808 \| 2521.952 \| 0.604 \| 0.617 \| \| None \| causal \| torch.bfloat16 \| (2, 16, 4096, 2, 4096, 128) \| 279.776 \| 463.904 \| 1068.928 \| 1849.888 \| 0.603 \| 0.578 \| \| relative_bias \| None \| torch.bfloat16 \| (2, 16, 4096, 2, 4096, 128) \| 446.080 \| 748.960 \| 1553.504 \| 2629.888 \| 0.596 \| 0.591 \| \| head_bias \| None \| torch.bfloat16 \| (2, 16, 4096, 2, 4096, 128) \| 446.208 \| 740.608 \| 1558.880 \| 2524.960 \| 0.602 \| 0.617 \| \| None \| None \| torch.bfloat16 \| (8, 16, 512, 16, 512, 64) \| 33.568 \| 41.280 \| 170.016 \| 147.584 \| 0.813 \| 1.152 \| \| None \| causal \| torch.bfloat16 \| (8, 16, 512, 16, 512, 64) \| 30.688 \| 43.040 \| 159.552 \| 146.720 \| 0.713 \| 1.087 \| \| relative_bias \| None \| torch.bfloat16 \| (8, 16, 512, 16, 512, 64) \| 34.112 \| 41.504 \| 170.112 \| 152.672 \| 0.822 \| 1.114 \| \| head_bias \| None \| torch.bfloat16 \| (8, 16, 512, 16, 512, 64) \| 34.240 \| 41.152 \| 170.272 \| 134.976 \| 0.832 \| 1.261 \| \| None \| None \| torch.bfloat16 \| (8, 16, 512, 16, 512, 128) \| 48.672 \| 76.416 \| 295.296 \| 263.648 \| 0.637 \| 1.120 \| \| None \| causal \| torch.bfloat16 \| (8, 16, 512, 16, 512, 128) \| 45.088 \| 72.576 \| 281.920 \| 237.664 \| 0.621 \| 1.186 \| \| relative_bias \| None \| torch.bfloat16 \| (8, 16, 512, 16, 512, 128) \| 48.032 \| 76.672 \| 295.520 \| 265.248 \| 0.626 \| 1.114 \| \| head_bias \| None \| torch.bfloat16 \| (8, 16, 512, 16, 512, 128) \| 48.096 \| 76.096 \| 295.456 \| 262.112 \| 0.632 \| 1.127 \| \| None \| None \| torch.bfloat16 \| (8, 16, 1024, 16, 1024, 64) \| 93.920 \| 111.232 \| 401.568 \| 382.944 \| 0.844 \| 1.049 \| \| None \| causal \| torch.bfloat16 \| (8, 16, 1024, 16, 1024, 64) \| 68.192 \| 95.232 \| 338.752 \| 326.816 \| 0.716 \| 1.037 \| \| relative_bias \| None \| torch.bfloat16 \| (8, 16, 1024, 16, 1024, 64) \| 93.984 \| 111.840 \| 401.856 \| 444.224 \| 0.840 \| 0.905 \| \| head_bias \| None \| torch.bfloat16 \| (8, 16, 1024, 16, 1024, 64) \| 94.176 \| 110.496 \| 401.600 \| 383.136 \| 0.852 \| 1.048 \| \| None \| None \| torch.bfloat16 \| (8, 16, 1024, 16, 1024, 128) \| 131.488 \| 227.040 \| 727.424 \| 739.712 \| 0.579 \| 0.983 \| \| None \| causal \| torch.bfloat16 \| (8, 16, 1024, 16, 1024, 128) \| 95.616 \| 169.760 \| 616.864 \| 574.112 \| 0.563 \| 1.074 \| \| relative_bias \| None \| torch.bfloat16 \| (8, 16, 1024, 16, 1024, 128) \| 131.680 \| 228.672 \| 727.616 \| 746.048 \| 0.576 \| 0.975 \| \| head_bias \| None \| torch.bfloat16 \| (8, 16, 1024, 16, 1024, 128) \| 131.104 \| 225.696 \| 727.904 \| 735.392 \| 0.581 \| 0.990 \| \| None \| None \| torch.bfloat16 \| (8, 16, 4096, 16, 4096, 64) \| 1227.296 \| 1386.656 \| 3720.192 \| 4539.904 \| 0.885 \| 0.819 \| \| None \| causal \| torch.bfloat16 \| (8, 16, 4096, 16, 4096, 64) \| 691.360 \| 831.712 \| 2515.872 \| 3067.808 \| 0.831 \| 0.820 \| \| relative_bias \| None \| torch.bfloat16 \| (8, 16, 4096, 16, 4096, 64) \| 1228.192 \| 1403.136 \| 3715.520 \| 5309.280 \| 0.875 \| 0.700 \| \| head_bias \| None \| torch.bfloat16 \| (8, 16, 4096, 16, 4096, 64) \| 1229.024 \| 1384.992 \| 3715.904 \| 4550.368 \| 0.887 \| 0.817 \| \| None \| None \| torch.bfloat16 \| (8, 16, 4096, 16, 4096, 128) \| 1784.832 \| 2865.888 \| 6539.840 \| 8460.224 \| 0.623 \| 0.773 \| \| None \| causal \| torch.bfloat16 \| (8, 16, 4096, 16, 4096, 128) \| 1017.408 \| 1660.480 \| 4369.824 \| 5056.992 \| 0.613 \| 0.864 \| \| relative_bias \| None \| torch.bfloat16 \| (8, 16, 4096, 16, 4096, 128) \| 1792.448 \| 2904.864 \| 6546.080 \| 8537.024 \| 0.617 \| 0.767 \| \| head_bias \| None \| torch.bfloat16 \| (8, 16, 4096, 16, 4096, 128) \| 1795.552 \| 2856.864 \| 6544.672 \| 8400.160 \| 0.629 \| 0.779 \| \| None \| None \| torch.bfloat16 \| (8, 16, 512, 2, 512, 64) \| 34.240 \| 38.880 \| 148.832 \| 179.936 \| 0.881 \| 0.827 \| \| None \| causal \| torch.bfloat16 \| (8, 16, 512, 2, 512, 64) \| 31.168 \| 38.080 \| 138.528 \| 167.552 \| 0.818 \| 0.827 \| \| relative_bias \| None \| torch.bfloat16 \| (8, 16, 512, 2, 512, 64) \| 34.240 \| 39.168 \| 148.512 \| 181.248 \| 0.874 \| 0.819 \| \| head_bias \| None \| torch.bfloat16 \| (8, 16, 512, 2, 512, 64) \| 34.240 \| 38.784 \| 148.864 \| 180.224 \| 0.883 \| 0.826 \| \| None \| None \| torch.bfloat16 \| (8, 16, 512, 2, 512, 128) \| 48.832 \| 76.352 \| 253.632 \| 295.968 \| 0.640 \| 0.857 \| \| None \| causal \| torch.bfloat16 \| (8, 16, 512, 2, 512, 128) \| 45.760 \| 65.792 \| 239.040 \| 290.752 \| 0.696 \| 0.822 \| \| relative_bias \| None \| torch.bfloat16 \| (8, 16, 512, 2, 512, 128) \| 48.768 \| 76.576 \| 253.312 \| 304.032 \| 0.637 \| 0.833 \| \| head_bias \| None \| torch.bfloat16 \| (8, 16, 512, 2, 512, 128) \| 48.768 \| 76.192 \| 253.600 \| 296.096 \| 0.640 \| 0.856 \| \| None \| None \| torch.bfloat16 \| (8, 16, 1024, 2, 1024, 64) \| 93.728 \| 109.728 \| 357.696 \| 498.912 \| 0.854 \| 0.717 \| \| None \| causal \| torch.bfloat16 \| (8, 16, 1024, 2, 1024, 64) \| 68.704 \| 92.288 \| 295.616 \| 386.240 \| 0.744 \| 0.765 \| \| relative_bias \| None \| torch.bfloat16 \| (8, 16, 1024, 2, 1024, 64) \| 93.632 \| 111.392 \| 357.408 \| 512.448 \| 0.841 \| 0.697 \| \| head_bias \| None \| torch.bfloat16 \| (8, 16, 1024, 2, 1024, 64) \| 93.280 \| 109.952 \| 357.696 \| 501.440 \| 0.848 \| 0.713 \| \| None \| None \| torch.bfloat16 \| (8, 16, 1024, 2, 1024, 128) \| 131.392 \| 230.496 \| 612.224 \| 807.552 \| 0.570 \| 0.758 \| \| None \| causal \| torch.bfloat16 \| (8, 16, 1024, 2, 1024, 128) \| 96.512 \| 165.184 \| 502.624 \| 672.384 \| 0.584 \| 0.748 \| \| relative_bias \| None \| torch.bfloat16 \| (8, 16, 1024, 2, 1024, 128) \| 131.360 \| 232.608 \| 612.064 \| 832.320 \| 0.565 \| 0.735 \| \| head_bias \| None \| torch.bfloat16 \| (8, 16, 1024, 2, 1024, 128) \| 131.008 \| 230.528 \| 612.640 \| 804.320 \| 0.568 \| 0.762 \| \| None \| None \| torch.bfloat16 \| (8, 16, 4096, 2, 4096, 64) \| 1227.968 \| 1377.408 \| 3477.920 \| 5324.384 \| 0.892 \| 0.653 \| \| None \| causal \| torch.bfloat16 \| (8, 16, 4096, 2, 4096, 64) \| 695.264 \| 824.544 \| 2268.224 \| 3210.208 \| 0.843 \| 0.707 \| \| relative_bias \| None \| torch.bfloat16 \| (8, 16, 4096, 2, 4096, 64) \| 1228.640 \| 1404.576 \| 3476.832 \| 5463.456 \| 0.875 \| 0.636 \| \| head_bias \| None \| torch.bfloat16 \| (8, 16, 4096, 2, 4096, 64) \| 1228.416 \| 1378.752 \| 3478.048 \| 5367.712 \| 0.891 \| 0.648 \| \| None \| None \| torch.bfloat16 \| (8, 16, 4096, 2, 4096, 128) \| 1788.736 \| 2867.712 \| 6039.520 \| 8616.256 \| 0.624 \| 0.701 \| \| None \| causal \| torch.bfloat16 \| (8, 16, 4096, 2, 4096, 128) \| 1021.952 \| 1653.824 \| 3866.208 \| 5306.848 \| 0.618 \| 0.729 \| \| relative_bias \| None \| torch.bfloat16 \| (8, 16, 4096, 2, 4096, 128) \| 1786.752 \| 2896.352 \| 6044.128 \| 8871.360 \| 0.617 \| 0.681 \| \| head_bias \| None \| torch.bfloat16 \| (8, 16, 4096, 2, 4096, 128) \| 1786.080 \| 2868.672 \| 6040.160 \| 8550.144 \| 0.623 \| 0.706 \| \| None \| None \| torch.bfloat16 \| (16, 16, 512, 16, 512, 64) \| 57.504 \| 71.552 \| 312.768 \| 255.040 \| 0.804 \| 1.226 \| \| None \| causal \| torch.bfloat16 \| (16, 16, 512, 16, 512, 64) \| 49.472 \| 71.104 \| 285.696 \| 243.520 \| 0.696 \| 1.173 \| \| relative_bias \| None \| torch.bfloat16 \| (16, 16, 512, 16, 512, 64) \| 58.112 \| 72.896 \| 312.768 \| 288.256 \| 0.797 \| 1.085 \| \| head_bias \| None \| torch.bfloat16 \| (16, 16, 512, 16, 512, 64) \| 57.952 \| 71.680 \| 312.768 \| 255.552 \| 0.808 \| 1.224 \| \| None \| None \| torch.bfloat16 \| (16, 16, 512, 16, 512, 128) \| 82.336 \| 144.256 \| 580.128 \| 500.160 \| 0.571 \| 1.160 \| \| None \| causal \| torch.bfloat16 \| (16, 16, 512, 16, 512, 128) \| 76.160 \| 123.712 \| 552.544 \| 447.648 \| 0.616 \| 1.234 \| \| relative_bias \| None \| torch.bfloat16 \| (16, 16, 512, 16, 512, 128) \| 82.400 \| 145.184 \| 580.032 \| 504.032 \| 0.568 \| 1.151 \| \| head_bias \| None \| torch.bfloat16 \| (16, 16, 512, 16, 512, 128) \| 82.368 \| 143.904 \| 580.192 \| 499.936 \| 0.572 \| 1.161 \| \| None \| None \| torch.bfloat16 \| (16, 16, 1024, 16, 1024, 64) \| 177.216 \| 209.568 \| 787.872 \| 747.712 \| 0.846 \| 1.054 \| \| None \| causal \| torch.bfloat16 \| (16, 16, 1024, 16, 1024, 64) \| 121.984 \| 168.256 \| 651.968 \| 628.256 \| 0.725 \| 1.038 \| \| relative_bias \| None \| torch.bfloat16 \| (16, 16, 1024, 16, 1024, 64) \| 177.088 \| 211.488 \| 788.320 \| 864.352 \| 0.837 \| 0.912 \| \| head_bias \| None \| torch.bfloat16 \| (16, 16, 1024, 16, 1024, 64) \| 177.440 \| 208.576 \| 787.424 \| 749.120 \| 0.851 \| 1.051 \| \| None \| None \| torch.bfloat16 \| (16, 16, 1024, 16, 1024, 128) \| 249.472 \| 441.376 \| 1405.440 \| 1431.648 \| 0.565 \| 0.982 \| \| None \| causal \| torch.bfloat16 \| (16, 16, 1024, 16, 1024, 128) \| 172.960 \| 312.064 \| 1172.064 \| 1096.448 \| 0.554 \| 1.069 \| \| relative_bias \| None \| torch.bfloat16 \| (16, 16, 1024, 16, 1024, 128) \| 249.632 \| 446.336 \| 1405.408 \| 1448.480 \| 0.559 \| 0.970 \| \| head_bias \| None \| torch.bfloat16 \| (16, 16, 1024, 16, 1024, 128) \| 250.944 \| 440.128 \| 1406.624 \| 1421.952 \| 0.570 \| 0.989 \| \| None \| None \| torch.bfloat16 \| (16, 16, 4096, 16, 4096, 64) \| 2418.720 \| 2747.936 \| 7330.432 \| 9023.712 \| 0.880 \| 0.812 \| \| None \| causal \| torch.bfloat16 \| (16, 16, 4096, 16, 4096, 64) \| 1353.696 \| 1608.480 \| 4941.696 \| 6078.752 \| 0.842 \| 0.813 \| \| relative_bias \| None \| torch.bfloat16 \| (16, 16, 4096, 16, 4096, 64) \| 2427.456 \| 2746.816 \| 7329.792 \| 10539.968 \| 0.884 \| 0.695 \| \| head_bias \| None \| torch.bfloat16 \| (16, 16, 4096, 16, 4096, 64) \| 2426.688 \| 2763.168 \| 7336.256 \| 9057.536 \| 0.878 \| 0.810 \| \| None \| None \| torch.bfloat16 \| (16, 16, 4096, 16, 4096, 128) \| 3554.240 \| 5634.400 \| 12919.872 \| 16843.489 \| 0.631 \| 0.767 \| \| None \| causal \| torch.bfloat16 \| (16, 16, 4096, 16, 4096, 128) \| 2003.648 \| 3250.784 \| 8610.144 \| 10015.424 \| 0.616 \| 0.860 \| \| relative_bias \| None \| torch.bfloat16 \| (16, 16, 4096, 16, 4096, 128) \| 3582.080 \| 5710.944 \| 12923.328 \| 17011.871 \| 0.627 \| 0.760 \| \| head_bias \| None \| torch.bfloat16 \| (16, 16, 4096, 16, 4096, 128) \| 3581.920 \| 5618.144 \| 12934.528 \| 16745.888 \| 0.638 \| 0.772 \| \| None \| None \| torch.bfloat16 \| (16, 16, 512, 2, 512, 64) \| 57.120 \| 71.232 \| 269.760 \| 295.680 \| 0.802 \| 0.912 \| \| None \| causal \| torch.bfloat16 \| (16, 16, 512, 2, 512, 64) \| 49.408 \| 65.312 \| 242.304 \| 253.952 \| 0.756 \| 0.954 \| \| relative_bias \| None \| torch.bfloat16 \| (16, 16, 512, 2, 512, 64) \| 57.504 \| 72.544 \| 269.632 \| 298.976 \| 0.793 \| 0.902 \| \| head_bias \| None \| torch.bfloat16 \| (16, 16, 512, 2, 512, 64) \| 57.760 \| 71.040 \| 269.600 \| 296.640 \| 0.813 \| 0.909 \| \| None \| None \| torch.bfloat16 \| (16, 16, 512, 2, 512, 128) \| 82.336 \| 147.168 \| 466.080 \| 487.456 \| 0.559 \| 0.956 \| \| None \| causal \| torch.bfloat16 \| (16, 16, 512, 2, 512, 128) \| 76.704 \| 115.040 \| 435.392 \| 453.248 \| 0.667 \| 0.961 \| \| relative_bias \| None \| torch.bfloat16 \| (16, 16, 512, 2, 512, 128) \| 81.856 \| 147.424 \| 465.920 \| 499.552 \| 0.555 \| 0.933 \| \| head_bias \| None \| torch.bfloat16 \| (16, 16, 512, 2, 512, 128) \| 81.760 \| 146.656 \| 466.176 \| 485.984 \| 0.557 \| 0.959 \| \| None \| None \| torch.bfloat16 \| (16, 16, 1024, 2, 1024, 64) \| 176.608 \| 206.976 \| 678.080 \| 866.976 \| 0.853 \| 0.782 \| \| None \| causal \| torch.bfloat16 \| (16, 16, 1024, 2, 1024, 64) \| 121.664 \| 164.768 \| 538.240 \| 636.160 \| 0.738 \| 0.846 \| \| relative_bias \| None \| torch.bfloat16 \| (16, 16, 1024, 2, 1024, 64) \| 176.608 \| 209.664 \| 677.696 \| 883.424 \| 0.842 \| 0.767 \| \| head_bias \| None \| torch.bfloat16 \| (16, 16, 1024, 2, 1024, 64) \| 177.440 \| 207.840 \| 677.248 \| 868.288 \| 0.854 \| 0.780 \| \| None \| None \| torch.bfloat16 \| (16, 16, 1024, 2, 1024, 128) \| 250.272 \| 449.536 \| 1163.424 \| 1420.832 \| 0.557 \| 0.819 \| \| None \| causal \| torch.bfloat16 \| (16, 16, 1024, 2, 1024, 128) \| 173.472 \| 305.376 \| 929.408 \| 1104.544 \| 0.568 \| 0.841 \| \| relative_bias \| None \| torch.bfloat16 \| (16, 16, 1024, 2, 1024, 128) \| 249.376 \| 454.976 \| 1163.648 \| 1455.296 \| 0.548 \| 0.800 \| \| head_bias \| None \| torch.bfloat16 \| (16, 16, 1024, 2, 1024, 128) \| 250.368 \| 450.144 \| 1163.520 \| 1409.984 \| 0.556 \| 0.825 \| \| None \| None \| torch.bfloat16 \| (16, 16, 4096, 2, 4096, 64) \| 2416.576 \| 2726.208 \| 6835.520 \| 10442.784 \| 0.886 \| 0.655 \| \| None \| causal \| torch.bfloat16 \| (16, 16, 4096, 2, 4096, 64) \| 1357.440 \| 1590.752 \| 4433.664 \| 5975.296 \| 0.853 \| 0.742 \| \| relative_bias \| None \| torch.bfloat16 \| (16, 16, 4096, 2, 4096, 64) \| 2427.360 \| 2747.040 \| 6853.056 \| 10670.784 \| 0.884 \| 0.642 \| \| head_bias \| None \| torch.bfloat16 \| (16, 16, 4096, 2, 4096, 64) \| 2441.120 \| 2718.944 \| 6836.640 \| 10433.792 \| 0.898 \| 0.655 \| \| None \| None \| torch.bfloat16 \| (16, 16, 4096, 2, 4096, 128) \| 3555.392 \| 5620.960 \| 11944.000 \| 16504.801 \| 0.633 \| 0.724 \| \| None \| causal \| torch.bfloat16 \| (16, 16, 4096, 2, 4096, 128) \| 2010.848 \| 3241.152 \| 7636.064 \| 9870.464 \| 0.620 \| 0.774 \| \| relative_bias \| None \| torch.bfloat16 \| (16, 16, 4096, 2, 4096, 128) \| 3557.440 \| 5688.352 \| 11935.744 \| 17090.496 \| 0.625 \| 0.698 \| \| head_bias \| None \| torch.bfloat16 \| (16, 16, 4096, 2, 4096, 128) \| 3562.720 \| 5630.432 \| 11939.168 \| 16392.033 \| 0.633 \| 0.728 \| </details> ### Perf after this PR FWD \| Type \| Speedup \| score_mod \| mask_mod \| dtype \| shape(B,Hq,M,Hkv,N,D) \| \|---------\|-----------\|---------------\|------------\|----------------\|----------------------------\| \| Average \| 0.776 \| \| \| \| \| \| Max \| 1.006 \| None \| None \| torch.bfloat16 \| (2, 16, 1024, 2, 1024, 64) \| \| Min \| 0.566 \| relative_bias \| None \| torch.bfloat16 \| (16, 16, 512, 2, 512, 128) \| BWD \| Type \| Speedup \| score_mod \| mask_mod \| dtype \| shape(B,Hq,M,Hkv,N,D) \| \|---------\|-----------\|-------------\|------------\|----------------\|-----------------------------\| \| Average \| 0.817 \| \| \| \| \| \| Max \| 1.150 \| None \| causal \| torch.bfloat16 \| (16, 16, 512, 16, 512, 128) \| \| Min \| 0.454 \| None \| causal \| torch.bfloat16 \| (2, 16, 1024, 2, 1024, 128) \| <details> <summary> Full performance sweep </summary> \| score_mod \| mask_mod \| dtype \| shape(B,Hq,M,Hkv,N,D) \| fwd_eager_time \| fwd_compiled_time \| bwd_eager_time \| bwd_compiled_time \| fwd_speedup \| bwd_speedup \| \|---------------\|------------\|----------------\|-------------------------------\|------------------\|---------------------\|------------------\|---------------------\|---------------\|---------------\| \| None \| None \| torch.bfloat16 \| (2, 16, 512, 16, 512, 64) \| 15.680 \| 17.056 \| 64.544 \| 73.376 \| 0.919 \| 0.880 \| \| None \| causal \| torch.bfloat16 \| (2, 16, 512, 16, 512, 64) \| 15.712 \| 19.872 \| 65.408 \| 72.864 \| 0.791 \| 0.898 \| \| relative_bias \| None \| torch.bfloat16 \| (2, 16, 512, 16, 512, 64) \| 16.160 \| 17.280 \| 64.896 \| 73.888 \| 0.935 \| 0.878 \| \| head_bias \| None \| torch.bfloat16 \| (2, 16, 512, 16, 512, 64) \| 16.192 \| 17.120 \| 64.896 \| 75.424 \| 0.946 \| 0.860 \| \| None \| None \| torch.bfloat16 \| (2, 16, 512, 16, 512, 128) \| 19.648 \| 22.496 \| 89.184 \| 82.592 \| 0.873 \| 1.080 \| \| None \| causal \| torch.bfloat16 \| (2, 16, 512, 16, 512, 128) \| 20.320 \| 26.816 \| 91.264 \| 82.880 \| 0.758 \| 1.101 \| \| relative_bias \| None \| torch.bfloat16 \| (2, 16, 512, 16, 512, 128) \| 20.096 \| 22.528 \| 89.184 \| 83.776 \| 0.892 \| 1.065 \| \| head_bias \| None \| torch.bfloat16 \| (2, 16, 512, 16, 512, 128) \| 19.680 \| 22.432 \| 89.184 \| 120.096 \| 0.877 \| 0.743 \| \| None \| None \| torch.bfloat16 \| (2, 16, 1024, 16, 1024, 64) \| 32.384 \| 32.512 \| 119.232 \| 128.960 \| 0.996 \| 0.925 \| \| None \| causal \| torch.bfloat16 \| (2, 16, 1024, 16, 1024, 64) \| 30.176 \| 37.248 \| 113.664 \| 119.520 \| 0.810 \| 0.951 \| \| relative_bias \| None \| torch.bfloat16 \| (2, 16, 1024, 16, 1024, 64) \| 32.512 \| 32.928 \| 119.264 \| 131.456 \| 0.987 \| 0.907 \| \| head_bias \| None \| torch.bfloat16 \| (2, 16, 1024, 16, 1024, 64) \| 32.448 \| 32.704 \| 119.200 \| 128.352 \| 0.992 \| 0.929 \| \| None \| None \| torch.bfloat16 \| (2, 16, 1024, 16, 1024, 128) \| 41.952 \| 62.176 \| 199.040 \| 214.304 \| 0.675 \| 0.929 \| \| None \| causal \| torch.bfloat16 \| (2, 16, 1024, 16, 1024, 128) \| 39.744 \| 62.880 \| 189.504 \| 179.968 \| 0.632 \| 1.053 \| \| relative_bias \| None \| torch.bfloat16 \| (2, 16, 1024, 16, 1024, 128) \| 41.472 \| 62.784 \| 199.136 \| 217.664 \| 0.661 \| 0.915 \| \| head_bias \| None \| torch.bfloat16 \| (2, 16, 1024, 16, 1024, 128) \| 42.048 \| 61.952 \| 199.168 \| 214.496 \| 0.679 \| 0.929 \| \| None \| None \| torch.bfloat16 \| (2, 16, 4096, 16, 4096, 64) \| 341.184 \| 357.632 \| 980.256 \| 1328.896 \| 0.954 \| 0.738 \| \| None \| causal \| torch.bfloat16 \| (2, 16, 4096, 16, 4096, 64) \| 212.576 \| 252.960 \| 673.888 \| 824.864 \| 0.840 \| 0.817 \| \| relative_bias \| None \| torch.bfloat16 \| (2, 16, 4096, 16, 4096, 64) \| 340.000 \| 363.296 \| 980.768 \| 1375.808 \| 0.936 \| 0.713 \| \| head_bias \| None \| torch.bfloat16 \| (2, 16, 4096, 16, 4096, 64) \| 340.768 \| 356.832 \| 980.960 \| 1326.272 \| 0.955 \| 0.740 \| \| None \| None \| torch.bfloat16 \| (2, 16, 4096, 16, 4096, 128) \| 459.392 \| 737.120 \| 1678.240 \| 2205.248 \| 0.623 \| 0.761 \| \| None \| causal \| torch.bfloat16 \| (2, 16, 4096, 16, 4096, 128) \| 292.672 \| 468.096 \| 1178.016 \| 1371.584 \| 0.625 \| 0.859 \| \| relative_bias \| None \| torch.bfloat16 \| (2, 16, 4096, 16, 4096, 128) \| 462.144 \| 745.312 \| 1680.000 \| 2252.512 \| 0.620 \| 0.746 \| \| head_bias \| None \| torch.bfloat16 \| (2, 16, 4096, 16, 4096, 128) \| 462.112 \| 736.576 \| 1679.008 \| 2216.480 \| 0.627 \| 0.758 \| \| None \| None \| torch.bfloat16 \| (2, 16, 512, 2, 512, 64) \| 16.064 \| 16.704 \| 105.120 \| 120.768 \| 0.962 \| 0.870 \| \| None \| causal \| torch.bfloat16 \| (2, 16, 512, 2, 512, 64) \| 15.552 \| 18.144 \| 107.136 \| 121.696 \| 0.857 \| 0.880 \| \| relative_bias \| None \| torch.bfloat16 \| (2, 16, 512, 2, 512, 64) \| 16.096 \| 16.768 \| 102.688 \| 120.864 \| 0.960 \| 0.850 \| \| head_bias \| None \| torch.bfloat16 \| (2, 16, 512, 2, 512, 64) \| 16.032 \| 16.576 \| 104.736 \| 124.672 \| 0.967 \| 0.840 \| \| None \| None \| torch.bfloat16 \| (2, 16, 512, 2, 512, 128) \| 19.392 \| 21.952 \| 104.736 \| 174.656 \| 0.883 \| 0.600 \| \| None \| causal \| torch.bfloat16 \| (2, 16, 512, 2, 512, 128) \| 20.128 \| 23.712 \| 105.216 \| 199.008 \| 0.849 \| 0.529 \| \| relative_bias \| None \| torch.bfloat16 \| (2, 16, 512, 2, 512, 128) \| 19.904 \| 21.888 \| 103.744 \| 179.520 \| 0.909 \| 0.578 \| \| head_bias \| None \| torch.bfloat16 \| (2, 16, 512, 2, 512, 128) \| 19.968 \| 21.952 \| 104.640 \| 177.312 \| 0.910 \| 0.590 \| \| None \| None \| torch.bfloat16 \| (2, 16, 1024, 2, 1024, 64) \| 32.096 \| 31.904 \| 118.720 \| 231.968 \| 1.006 \| 0.512 \| \| None \| causal \| torch.bfloat16 \| (2, 16, 1024, 2, 1024, 64) \| 30.528 \| 33.952 \| 112.480 \| 218.304 \| 0.899 \| 0.515 \| \| relative_bias \| None \| torch.bfloat16 \| (2, 16, 1024, 2, 1024, 64) \| 32.160 \| 32.224 \| 118.752 \| 237.312 \| 0.998 \| 0.500 \| \| head_bias \| None \| torch.bfloat16 \| (2, 16, 1024, 2, 1024, 64) \| 32.128 \| 32.032 \| 118.240 \| 233.120 \| 1.003 \| 0.507 \| \| None \| None \| torch.bfloat16 \| (2, 16, 1024, 2, 1024, 128) \| 41.312 \| 61.280 \| 177.408 \| 350.688 \| 0.674 \| 0.506 \| \| None \| causal \| torch.bfloat16 \| (2, 16, 1024, 2, 1024, 128) \| 39.552 \| 59.360 \| 168.832 \| 371.488 \| 0.666 \| 0.454 \| \| relative_bias \| None \| torch.bfloat16 \| (2, 16, 1024, 2, 1024, 128) \| 41.984 \| 61.696 \| 177.376 \| 360.416 \| 0.680 \| 0.492 \| \| head_bias \| None \| torch.bfloat16 \| (2, 16, 1024, 2, 1024, 128) \| 41.312 \| 61.760 \| 177.184 \| 355.744 \| 0.669 \| 0.498 \| \| None \| None \| torch.bfloat16 \| (2, 16, 4096, 2, 4096, 64) \| 339.744 \| 357.888 \| 939.712 \| 1665.376 \| 0.949 \| 0.564 \| \| None \| causal \| torch.bfloat16 \| (2, 16, 4096, 2, 4096, 64) \| 212.608 \| 248.832 \| 633.280 \| 1122.848 \| 0.854 \| 0.564 \| \| relative_bias \| None \| torch.bfloat16 \| (2, 16, 4096, 2, 4096, 64) \| 339.712 \| 363.232 \| 940.448 \| 1689.440 \| 0.935 \| 0.557 \| \| head_bias \| None \| torch.bfloat16 \| (2, 16, 4096, 2, 4096, 64) \| 341.056 \| 355.264 \| 940.128 \| 1641.152 \| 0.960 \| 0.573 \| \| None \| None \| torch.bfloat16 \| (2, 16, 4096, 2, 4096, 128) \| 460.736 \| 741.024 \| 1569.824 \| 2559.552 \| 0.622 \| 0.613 \| \| None \| causal \| torch.bfloat16 \| (2, 16, 4096, 2, 4096, 128) \| 293.856 \| 464.192 \| 1066.240 \| 1840.416 \| 0.633 \| 0.579 \| \| relative_bias \| None \| torch.bfloat16 \| (2, 16, 4096, 2, 4096, 128) \| 460.704 \| 753.152 \| 1570.112 \| 2641.088 \| 0.612 \| 0.594 \| \| head_bias \| None \| torch.bfloat16 \| (2, 16, 4096, 2, 4096, 128) \| 460.832 \| 745.536 \| 1570.144 \| 2602.560 \| 0.618 \| 0.603 \| \| None \| None \| torch.bfloat16 \| (8, 16, 512, 16, 512, 64) \| 35.680 \| 41.280 \| 171.840 \| 158.176 \| 0.864 \| 1.086 \| \| None \| causal \| torch.bfloat16 \| (8, 16, 512, 16, 512, 64) \| 31.360 \| 42.976 \| 158.912 \| 139.264 \| 0.730 \| 1.141 \| \| relative_bias \| None \| torch.bfloat16 \| (8, 16, 512, 16, 512, 64) \| 35.168 \| 41.600 \| 171.648 \| 161.344 \| 0.845 \| 1.064 \| \| head_bias \| None \| torch.bfloat16 \| (8, 16, 512, 16, 512, 64) \| 35.136 \| 41.152 \| 171.808 \| 158.336 \| 0.854 \| 1.085 \| \| None \| None \| torch.bfloat16 \| (8, 16, 512, 16, 512, 128) \| 48.832 \| 76.384 \| 295.680 \| 277.696 \| 0.639 \| 1.065 \| \| None \| causal \| torch.bfloat16 \| (8, 16, 512, 16, 512, 128) \| 45.632 \| 72.512 \| 281.760 \| 250.752 \| 0.629 \| 1.124 \| \| relative_bias \| None \| torch.bfloat16 \| (8, 16, 512, 16, 512, 128) \| 49.504 \| 76.608 \| 295.584 \| 279.712 \| 0.646 \| 1.057 \| \| head_bias \| None \| torch.bfloat16 \| (8, 16, 512, 16, 512, 128) \| 48.864 \| 75.904 \| 295.456 \| 277.568 \| 0.644 \| 1.064 \| \| None \| None \| torch.bfloat16 \| (8, 16, 1024, 16, 1024, 64) \| 99.392 \| 111.232 \| 408.640 \| 442.656 \| 0.894 \| 0.923 \| \| None \| causal \| torch.bfloat16 \| (8, 16, 1024, 16, 1024, 64) \| 71.392 \| 95.168 \| 338.784 \| 341.760 \| 0.750 \| 0.991 \| \| relative_bias \| None \| torch.bfloat16 \| (8, 16, 1024, 16, 1024, 64) \| 99.808 \| 112.256 \| 408.608 \| 456.160 \| 0.889 \| 0.896 \| \| head_bias \| None \| torch.bfloat16 \| (8, 16, 1024, 16, 1024, 64) \| 100.032 \| 110.816 \| 408.512 \| 444.192 \| 0.903 \| 0.920 \| \| None \| None \| torch.bfloat16 \| (8, 16, 1024, 16, 1024, 128) \| 135.040 \| 226.112 \| 726.880 \| 774.176 \| 0.597 \| 0.939 \| \| None \| causal \| torch.bfloat16 \| (8, 16, 1024, 16, 1024, 128) \| 99.904 \| 169.696 \| 616.448 \| 607.104 \| 0.589 \| 1.015 \| \| relative_bias \| None \| torch.bfloat16 \| (8, 16, 1024, 16, 1024, 128) \| 135.488 \| 228.384 \| 727.776 \| 782.368 \| 0.593 \| 0.930 \| \| head_bias \| None \| torch.bfloat16 \| (8, 16, 1024, 16, 1024, 128) \| 135.744 \| 225.664 \| 728.000 \| 773.600 \| 0.602 \| 0.941 \| \| None \| None \| torch.bfloat16 \| (8, 16, 4096, 16, 4096, 64) \| 1324.192 \| 1387.808 \| 3866.944 \| 5217.184 \| 0.954 \| 0.741 \| \| None \| causal \| torch.bfloat16 \| (8, 16, 4096, 16, 4096, 64) \| 738.464 \| 832.608 \| 2507.392 \| 3146.688 \| 0.887 \| 0.797 \| \| relative_bias \| None \| torch.bfloat16 \| (8, 16, 4096, 16, 4096, 64) \| 1326.016 \| 1404.256 \| 3867.872 \| 5382.624 \| 0.944 \| 0.719 \| \| head_bias \| None \| torch.bfloat16 \| (8, 16, 4096, 16, 4096, 64) \| 1326.144 \| 1386.688 \| 3867.552 \| 5203.264 \| 0.956 \| 0.743 \| \| None \| None \| torch.bfloat16 \| (8, 16, 4096, 16, 4096, 128) \| 1847.488 \| 2866.336 \| 6612.704 \| 8597.696 \| 0.645 \| 0.769 \| \| None \| causal \| torch.bfloat16 \| (8, 16, 4096, 16, 4096, 128) \| 1066.592 \| 1660.640 \| 4357.696 \| 5174.016 \| 0.642 \| 0.842 \| \| relative_bias \| None \| torch.bfloat16 \| (8, 16, 4096, 16, 4096, 128) \| 1850.464 \| 2905.408 \| 6616.928 \| 8793.280 \| 0.637 \| 0.752 \| \| head_bias \| None \| torch.bfloat16 \| (8, 16, 4096, 16, 4096, 128) \| 1848.896 \| 2834.720 \| 6623.872 \| 8637.920 \| 0.652 \| 0.767 \| \| None \| None \| torch.bfloat16 \| (8, 16, 512, 2, 512, 64) \| 36.384 \| 38.656 \| 150.336 \| 182.624 \| 0.941 \| 0.823 \| \| None \| causal \| torch.bfloat16 \| (8, 16, 512, 2, 512, 64) \| 31.360 \| 38.112 \| 137.664 \| 171.840 \| 0.823 \| 0.801 \| \| relative_bias \| None \| torch.bfloat16 \| (8, 16, 512, 2, 512, 64) \| 36.608 \| 39.040 \| 150.528 \| 183.872 \| 0.938 \| 0.819 \| \| head_bias \| None \| torch.bfloat16 \| (8, 16, 512, 2, 512, 64) \| 36.064 \| 38.656 \| 150.560 \| 183.520 \| 0.933 \| 0.820 \| \| None \| None \| torch.bfloat16 \| (8, 16, 512, 2, 512, 128) \| 49.344 \| 76.352 \| 253.920 \| 301.440 \| 0.646 \| 0.842 \| \| None \| causal \| torch.bfloat16 \| (8, 16, 512, 2, 512, 128) \| 46.720 \| 65.824 \| 239.424 \| 296.384 \| 0.710 \| 0.808 \| \| relative_bias \| None \| torch.bfloat16 \| (8, 16, 512, 2, 512, 128) \| 49.248 \| 76.416 \| 253.728 \| 307.808 \| 0.644 \| 0.824 \| \| head_bias \| None \| torch.bfloat16 \| (8, 16, 512, 2, 512, 128) \| 49.376 \| 76.288 \| 253.728 \| 304.736 \| 0.647 \| 0.833 \| \| None \| None \| torch.bfloat16 \| (8, 16, 1024, 2, 1024, 64) \| 99.264 \| 110.144 \| 364.960 \| 503.072 \| 0.901 \| 0.725 \| \| None \| causal \| torch.bfloat16 \| (8, 16, 1024, 2, 1024, 64) \| 71.136 \| 92.384 \| 294.432 \| 393.056 \| 0.770 \| 0.749 \| \| relative_bias \| None \| torch.bfloat16 \| (8, 16, 1024, 2, 1024, 64) \| 99.200 \| 111.360 \| 365.152 \| 512.640 \| 0.891 \| 0.712 \| \| head_bias \| None \| torch.bfloat16 \| (8, 16, 1024, 2, 1024, 64) \| 99.264 \| 110.240 \| 365.088 \| 504.224 \| 0.900 \| 0.724 \| \| None \| None \| torch.bfloat16 \| (8, 16, 1024, 2, 1024, 128) \| 135.680 \| 230.336 \| 613.472 \| 816.896 \| 0.589 \| 0.751 \| \| None \| causal \| torch.bfloat16 \| (8, 16, 1024, 2, 1024, 128) \| 100.256 \| 165.088 \| 502.144 \| 676.480 \| 0.607 \| 0.742 \| \| relative_bias \| None \| torch.bfloat16 \| (8, 16, 1024, 2, 1024, 128) \| 135.008 \| 232.480 \| 613.184 \| 836.672 \| 0.581 \| 0.733 \| \| head_bias \| None \| torch.bfloat16 \| (8, 16, 1024, 2, 1024, 128) \| 135.232 \| 230.624 \| 613.536 \| 827.136 \| 0.586 \| 0.742 \| \| None \| None \| torch.bfloat16 \| (8, 16, 4096, 2, 4096, 64) \| 1324.064 \| 1378.688 \| 3631.808 \| 5308.384 \| 0.960 \| 0.684 \| \| None \| causal \| torch.bfloat16 \| (8, 16, 4096, 2, 4096, 64) \| 731.776 \| 826.688 \| 2263.168 \| 3241.344 \| 0.885 \| 0.698 \| \| relative_bias \| None \| torch.bfloat16 \| (8, 16, 4096, 2, 4096, 64) \| 1316.128 \| 1403.200 \| 3625.088 \| 5550.688 \| 0.938 \| 0.653 \| \| head_bias \| None \| torch.bfloat16 \| (8, 16, 4096, 2, 4096, 64) \| 1311.904 \| 1378.880 \| 3616.320 \| 5353.696 \| 0.951 \| 0.675 \| \| None \| None \| torch.bfloat16 \| (8, 16, 4096, 2, 4096, 128) \| 1837.856 \| 2887.392 \| 6121.632 \| 8586.656 \| 0.637 \| 0.713 \| \| None \| causal \| torch.bfloat16 \| (8, 16, 4096, 2, 4096, 128) \| 1066.976 \| 1654.368 \| 3843.136 \| 5291.040 \| 0.645 \| 0.726 \| \| relative_bias \| None \| torch.bfloat16 \| (8, 16, 4096, 2, 4096, 128) \| 1854.208 \| 2896.832 \| 6130.112 \| 8745.984 \| 0.640 \| 0.701 \| \| head_bias \| None \| torch.bfloat16 \| (8, 16, 4096, 2, 4096, 128) \| 1860.512 \| 2889.344 \| 6135.648 \| 8750.592 \| 0.644 \| 0.701 \| \| None \| None \| torch.bfloat16 \| (16, 16, 512, 16, 512, 64) \| 60.640 \| 71.552 \| 315.968 \| 296.512 \| 0.847 \| 1.066 \| \| None \| causal \| torch.bfloat16 \| (16, 16, 512, 16, 512, 64) \| 50.784 \| 71.040 \| 284.288 \| 258.880 \| 0.715 \| 1.098 \| \| relative_bias \| None \| torch.bfloat16 \| (16, 16, 512, 16, 512, 64) \| 61.312 \| 72.704 \| 315.680 \| 302.016 \| 0.843 \| 1.045 \| \| head_bias \| None \| torch.bfloat16 \| (16, 16, 512, 16, 512, 64) \| 60.800 \| 71.776 \| 316.320 \| 297.152 \| 0.847 \| 1.065 \| \| None \| None \| torch.bfloat16 \| (16, 16, 512, 16, 512, 128) \| 84.576 \| 144.416 \| 580.576 \| 535.936 \| 0.586 \| 1.083 \| \| None \| causal \| torch.bfloat16 \| (16, 16, 512, 16, 512, 128) \| 76.064 \| 123.648 \| 553.344 \| 481.376 \| 0.615 \| 1.150 \| \| relative_bias \| None \| torch.bfloat16 \| (16, 16, 512, 16, 512, 128) \| 84.160 \| 145.248 \| 581.024 \| 540.000 \| 0.579 \| 1.076 \| \| head_bias \| None \| torch.bfloat16 \| (16, 16, 512, 16, 512, 128) \| 84.512 \| 143.552 \| 581.088 \| 535.776 \| 0.589 \| 1.085 \| \| None \| None \| torch.bfloat16 \| (16, 16, 1024, 16, 1024, 64) \| 189.152 \| 209.408 \| 798.400 \| 868.704 \| 0.903 \| 0.919 \| \| None \| causal \| torch.bfloat16 \| (16, 16, 1024, 16, 1024, 64) \| 127.552 \| 168.800 \| 650.816 \| 663.328 \| 0.756 \| 0.981 \| \| relative_bias \| None \| torch.bfloat16 \| (16, 16, 1024, 16, 1024, 64) \| 189.376 \| 211.360 \| 798.080 \| 895.552 \| 0.896 \| 0.891 \| \| head_bias \| None \| torch.bfloat16 \| (16, 16, 1024, 16, 1024, 64) \| 189.440 \| 208.576 \| 797.888 \| 873.152 \| 0.908 \| 0.914 \| \| None \| None \| torch.bfloat16 \| (16, 16, 1024, 16, 1024, 128) \| 257.536 \| 441.760 \| 1408.960 \| 1514.720 \| 0.583 \| 0.930 \| \| None \| causal \| torch.bfloat16 \| (16, 16, 1024, 16, 1024, 128) \| 179.328 \| 312.096 \| 1170.368 \| 1177.472 \| 0.575 \| 0.994 \| \| relative_bias \| None \| torch.bfloat16 \| (16, 16, 1024, 16, 1024, 128) \| 259.264 \| 446.944 \| 1408.768 \| 1530.400 \| 0.580 \| 0.921 \| \| head_bias \| None \| torch.bfloat16 \| (16, 16, 1024, 16, 1024, 128) \| 258.080 \| 440.480 \| 1408.864 \| 1514.144 \| 0.586 \| 0.930 \| \| None \| None \| torch.bfloat16 \| (16, 16, 4096, 16, 4096, 64) \| 2595.808 \| 2771.456 \| 7616.704 \| 10405.248 \| 0.937 \| 0.732 \| \| None \| causal \| torch.bfloat16 \| (16, 16, 4096, 16, 4096, 64) \| 1435.744 \| 1610.336 \| 4927.520 \| 6220.000 \| 0.892 \| 0.792 \| \| relative_bias \| None \| torch.bfloat16 \| (16, 16, 4096, 16, 4096, 64) \| 2595.264 \| 2745.056 \| 7611.232 \| 10631.392 \| 0.945 \| 0.716 \| \| head_bias \| None \| torch.bfloat16 \| (16, 16, 4096, 16, 4096, 64) \| 2576.256 \| 2735.456 \| 7626.400 \| 10346.976 \| 0.942 \| 0.737 \| \| None \| None \| torch.bfloat16 \| (16, 16, 4096, 16, 4096, 128) \| 3679.744 \| 5634.816 \| 13077.056 \| 17182.528 \| 0.653 \| 0.761 \| \| None \| causal \| torch.bfloat16 \| (16, 16, 4096, 16, 4096, 128) \| 2099.360 \| 3250.176 \| 8589.664 \| 10236.672 \| 0.646 \| 0.839 \| \| relative_bias \| None \| torch.bfloat16 \| (16, 16, 4096, 16, 4096, 128) \| 3676.800 \| 5716.288 \| 13073.088 \| 17311.071 \| 0.643 \| 0.755 \| \| head_bias \| None \| torch.bfloat16 \| (16, 16, 4096, 16, 4096, 128) \| 3679.136 \| 5570.496 \| 13070.720 \| 17192.863 \| 0.660 \| 0.760 \| \| None \| None \| torch.bfloat16 \| (16, 16, 512, 2, 512, 64) \| 61.600 \| 71.008 \| 272.320 \| 300.000 \| 0.868 \| 0.908 \| \| None \| causal \| torch.bfloat16 \| (16, 16, 512, 2, 512, 64) \| 50.176 \| 65.344 \| 241.568 \| 258.912 \| 0.768 \| 0.933 \| \| relative_bias \| None \| torch.bfloat16 \| (16, 16, 512, 2, 512, 64) \| 61.120 \| 72.512 \| 272.672 \| 305.408 \| 0.843 \| 0.893 \| \| head_bias \| None \| torch.bfloat16 \| (16, 16, 512, 2, 512, 64) \| 61.248 \| 71.136 \| 272.640 \| 301.120 \| 0.861 \| 0.905 \| \| None \| None \| torch.bfloat16 \| (16, 16, 512, 2, 512, 128) \| 83.872 \| 146.784 \| 466.912 \| 496.832 \| 0.571 \| 0.940 \| \| None \| causal \| torch.bfloat16 \| (16, 16, 512, 2, 512, 128) \| 76.704 \| 115.072 \| 435.584 \| 462.112 \| 0.667 \| 0.943 \| \| relative_bias \| None \| torch.bfloat16 \| (16, 16, 512, 2, 512, 128) \| 83.392 \| 147.392 \| 466.656 \| 504.448 \| 0.566 \| 0.925 \| \| head_bias \| None \| torch.bfloat16 \| (16, 16, 512, 2, 512, 128) \| 83.360 \| 146.688 \| 466.656 \| 499.040 \| 0.568 \| 0.935 \| \| None \| None \| torch.bfloat16 \| (16, 16, 1024, 2, 1024, 64) \| 189.024 \| 207.584 \| 684.768 \| 873.568 \| 0.911 \| 0.784 \| \| None \| causal \| torch.bfloat16 \| (16, 16, 1024, 2, 1024, 64) \| 126.944 \| 164.288 \| 536.192 \| 645.984 \| 0.773 \| 0.830 \| \| relative_bias \| None \| torch.bfloat16 \| (16, 16, 1024, 2, 1024, 64) \| 188.768 \| 209.760 \| 684.096 \| 897.504 \| 0.900 \| 0.762 \| \| head_bias \| None \| torch.bfloat16 \| (16, 16, 1024, 2, 1024, 64) \| 189.408 \| 207.776 \| 685.024 \| 876.384 \| 0.912 \| 0.782 \| \| None \| None \| torch.bfloat16 \| (16, 16, 1024, 2, 1024, 128) \| 259.168 \| 449.536 \| 1167.936 \| 1433.280 \| 0.577 \| 0.815 \| \| None \| causal \| torch.bfloat16 \| (16, 16, 1024, 2, 1024, 128) \| 180.000 \| 305.312 \| 928.000 \| 1113.920 \| 0.590 \| 0.833 \| \| relative_bias \| None \| torch.bfloat16 \| (16, 16, 1024, 2, 1024, 128) \| 258.464 \| 455.136 \| 1167.808 \| 1462.848 \| 0.568 \| 0.798 \| \| head_bias \| None \| torch.bfloat16 \| (16, 16, 1024, 2, 1024, 128) \| 257.824 \| 450.208 \| 1167.744 \| 1448.000 \| 0.573 \| 0.806 \| \| None \| None \| torch.bfloat16 \| (16, 16, 4096, 2, 4096, 64) \| 2598.368 \| 2729.120 \| 7134.400 \| 10381.632 \| 0.952 \| 0.687 \| \| None \| causal \| torch.bfloat16 \| (16, 16, 4096, 2, 4096, 64) \| 1435.456 \| 1591.040 \| 4424.768 \| 6035.808 \| 0.902 \| 0.733 \| \| relative_bias \| None \| torch.bfloat16 \| (16, 16, 4096, 2, 4096, 64) \| 2594.752 \| 2725.952 \| 7128.384 \| 10822.496 \| 0.952 \| 0.659 \| \| head_bias \| None \| torch.bfloat16 \| (16, 16, 4096, 2, 4096, 64) \| 2597.888 \| 2716.960 \| 7101.568 \| 10385.440 \| 0.956 \| 0.684 \| \| None \| None \| torch.bfloat16 \| (16, 16, 4096, 2, 4096, 128) \| 3647.648 \| 5581.632 \| 12089.952 \| 16667.233 \| 0.654 \| 0.725 \| \| None \| causal \| torch.bfloat16 \| (16, 16, 4096, 2, 4096, 128) \| 2093.952 \| 3241.440 \| 7579.392 \| 9847.936 \| 0.646 \| 0.770 \| \| relative_bias \| None \| torch.bfloat16 \| (16, 16, 4096, 2, 4096, 128) \| 3650.528 \| 5650.688 \| 12105.568 \| 16963.680 \| 0.646 \| 0.714 \| \| head_bias \| None \| torch.bfloat16 \| (16, 16, 4096, 2, 4096, 128) \| 3680.064 \| 5585.312 \| 12117.504 \| 16935.040 \| 0.659 \| 0.716 \| </details> Pull Request resolved: https://github.com/pytorch/pytorch/pull/135505 Approved by: https://github.com/Chillee	2024-09-10 09:30:02 +00:00
Roy Hvaara	23b1486185	[MPS] Allow nan mean reduction in `nll_loss` (#135434 ) This PR allows results from `nn_loss` to be `nan`, which is the same behavior as with CUDA and CPU https://github.com/pytorch/pytorch/pull/64572#issuecomment-926504162. Fixes #134431 Ref #64572 #119108 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135434 Approved by: https://github.com/malfet	2024-09-10 08:37:59 +00:00
Victor Tao	9902b349cb	[Inductor] Make static_input_idxs a set for faster lookup (#135314 ) `static_input_idxs` is only used for lookups. With large models, this is a large list. This takes over a millisecond in some cases. Profile before change: <img width="824" alt="image" src="https://github.com/user-attachments/assets/002a0775-fd2f-4d27-8cf2-812b502d7d5e"> Profile after change: gaps are smaller, 1ms speedup before launching the cuda graph <img width="794" alt="image" src="https://github.com/user-attachments/assets/12a0a0b9-2cc1-4d53-ac87-9bd5140a46f5"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/135314 Approved by: https://github.com/oulgen	2024-09-10 07:27:55 +00:00
Tugsbayasgalan Manlaibaatar	5a9ac83e94	Fix doc (#135551 ) Differential Revision: [D62412667](https://our.internmc.facebook.com/intern/diff/D62412667/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135551 Approved by: https://github.com/yushangdi ghstack dependencies: #135549	2024-09-10 07:18:44 +00:00
Sam Larsen	1adf28a5c0	[inductor] print triton float64 constants correctly (#135260 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135260 Approved by: https://github.com/jansel	2024-09-10 07:05:02 +00:00
Tugsbayasgalan Manlaibaatar	c18052da0e	Add some minor doc improvement and ban using training IR for unflattener (#135549 ) Title Differential Revision: [D62412490](https://our.internmc.facebook.com/intern/diff/D62412490/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135549 Approved by: https://github.com/yushangdi	2024-09-10 06:48:42 +00:00
Yichen Yan	c0d2f991b1	Increase `TRITON_MAX_BLOCK['X']` (#135181 ) Fixes #135028 As title, increase `TRITON_MAX_BLOCK['X']` to 4096 and fix an error, thanks to @Chillee: https://github.com/pytorch/pytorch/pull/133300/files#r1744706189 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135181 Approved by: https://github.com/jansel	2024-09-10 05:54:37 +00:00
Thomas Bohnstingl	e889252493	Implementation of scan (#134102 ) This operation is supposed to be the pendant to the `associative_scan`, but can operate with non-associative functions. @ydwu4 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134102 Approved by: https://github.com/ydwu4	2024-09-10 04:51:16 +00:00
Avik Chaudhuri	6546c6186d	do not raise when flatten_fn_with_keys not found when suggesting fixes (#135518 ) Test Plan: added test Differential Revision: D62395371 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135518 Approved by: https://github.com/zhxchen17	2024-09-10 03:47:36 +00:00
Chien-Chin Huang	1d9fefff19	[DCP] Fixes the stateless optimizer issue of distributed state_dict (#135535 ) Some optimizers don't have states that can cause get_state_dict/set_state_dict behave incorrectly. This PR fixes the issues. fixes: https://github.com/pytorch/pytorch/issues/133415 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135535 Approved by: https://github.com/wz337	2024-09-10 03:10:00 +00:00
zengxian	7ec17b49cf	Fix dynamo benchmark skip logic for cpu device (#135193 ) Fixes #132380, adjust torchbench and huggingface skip models list, then we can remove `--no-skip` when running benchmarks on 3 suites. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135193 Approved by: https://github.com/chuanqi129, https://github.com/jansel	2024-09-10 03:02:19 +00:00
Wu, Chunyuan	146921007a	[inductor] [cpp] fix the input contiguous check in max-autotune (#134982 ) ## Description Fixes the FP32 accuracy failure of `resmlp_12_224` and BF16 accuracy failure of `volo_d1_224` in timm. In this PR, we check whether input is contiguous using the following way: If it has `FixedLayout`, we know the accurate strides. For `FlexibleLayout`, if its data is a `ComputedBuffer`, we could get the fill order of the buffer to decide whether it's contiguous. For the other cases, we won't use GEMM template as we can't infer whether it's contiguous. ## Additional context The current GEMM template only supports this case: `input.get_stride()[-1] == 1`. In `resmlp_12_224`, when we run into this check, the layout of `input` is a `FlexibleLayout`. The reason is that when realizing the input which is a `View` IR, the `convert_to_reinterpret_view` call fails: `d14fe3ffed/torch/_inductor/ir.py (L4712-L4715)` And it finally runs into this `copy_input` and returns a `FlexibleLayout`. `d14fe3ffed/torch/_inductor/ir.py (L4722)` When checking its stride, this `FlexibleLayout` indeed satisfies `input.get_stride()[-1] == 1` but it is later decided as a `FixedLayout` with `size = (3072, 196), stride = (1, 3072)`, which is not supported by the GEMM template, thus causing accuracy issue in this model. The `FlexibleLayout` is converted to `FixedLayout` during [CppPackedGemmTemplate.add_choices](`d14fe3ffed/torch/_inductor/mkldnn_lowerings.py (L1051)`) which calls [slice_nd](`d14fe3ffed/torch/_inductor/codegen/cpp_template_kernel.py (L150)`) when rendering the kernel (`slice_nd(X)`). When creating the `SliceView` IR, [as_storage_and_layout](`d14fe3ffed/torch/_inductor/ir.py (L2288)`) invokes [decide_layout](`d14fe3ffed/torch/_inductor/ir.py (L2135)`) and converts it to a `FixedLayout` with `size = (3072, 196), stride = (1, 3072)`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134982 Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel, https://github.com/jansel	2024-09-10 02:47:38 +00:00
Yueming Hao	a71e5509bc	[inductor]Add profiler to operatorbench (#135515 ) Add profiling to operatorbench. The new argument `--profile` is added and the profiling trace is like the following figure. <img width="954" alt="image" src="https://github.com/user-attachments/assets/5b00d6e3-4905-4a77-a5e9-9f62620a5fd5"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/135515 Approved by: https://github.com/shunting314	2024-09-10 02:33:30 +00:00
Guilherme Leobas	136e28f616	Enable forward AD in functional.affine_grid (#135494 ) Fixes #121411 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135494 Approved by: https://github.com/zou3519, https://github.com/soulitzer	2024-09-10 00:07:07 +00:00
Jeff Daily	39a61795e3	remove amax_ptr from scaled_gemm (#135421 ) amax was removed from _scaled_mm by #128683. Remove it from the internal at::cuda::blas::scaled_gemm, as well. This allows hipBLASLt to find additional solutions rather than forcing amax to be used and then discarding the result. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135421 Approved by: https://github.com/drisspg, https://github.com/eqy	2024-09-09 23:04:36 +00:00
Scott Wolchok	b4feec9782	[xplat][XNNPACK] don't prefer static linkage in xplat for main target (#135529 ) Building XNNPACK as a static library has some issues because of multiple global params floating around. Let's try to get rid of it in xplat and see how it fares. Differential Revision: [D60776152](https://our.internmc.facebook.com/intern/diff/D60776152/) NOTE FOR REVIEWERS: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D60776152/)! Pull Request resolved: https://github.com/pytorch/pytorch/pull/135529 Approved by: https://github.com/kimishpatel, https://github.com/mcr229, https://github.com/kirklandsign	2024-09-09 22:47:01 +00:00
Yanbo Liang	d81731615f	[Dynamo] Adding CallFunctionNoArgsSource and (#135425 ) CallFunctionNoArgsGuardAccessor to support torch.cuda.current_device() Pull Request resolved: https://github.com/pytorch/pytorch/pull/135425 Approved by: https://github.com/anijain2305	2024-09-09 22:46:00 +00:00
shubhambhokare1	e2f9a83b85	[ONNX] Drop final None values as inputs for nodes in exporter graph (#135520 ) When value for an optional input is not provided, it is defaulted to `None`, which gets translates to "" in the onnx graph. To avoid this, if we have a list of inputs and the final few are all `None`, strip them in the graph. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135520 Approved by: https://github.com/justinchuby Co-authored-by: Justin Chu <justinchuby@users.noreply.github.com>	2024-09-09 22:28:41 +00:00
PyTorch MergeBot	70a65a8bd5	Revert "NJT <-> padded dense conversions (#125947 )" This reverts commit 09a5e88bef04d5485b70d8f65f46a675aaa52942. Reverted https://github.com/pytorch/pytorch/pull/125947 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing dynamo test `09a5e88bef`, maybe a landrace ([comment](https://github.com/pytorch/pytorch/pull/125947#issuecomment-2339228570))	2024-09-09 22:01:09 +00:00
PyTorch MergeBot	689d278543	Revert "Add `__init__.py` to shape inference folder. (#135461 )" This reverts commit dced0d6d9f05f0962f74a3c6227f774111c15715. Reverted https://github.com/pytorch/pytorch/pull/135461 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it exposes some public function without appropriate doc. I will reopen the issue with hi-prio so that it can be fixed properly ([comment](https://github.com/pytorch/pytorch/pull/135461#issuecomment-2339218382))	2024-09-09 21:55:13 +00:00
atalman	9b764491e3	Use upload-artifact@v4.4.0 for create_release.yml (#135528 ) Fixes failure: https://github.com/pytorch/pytorch/actions/runs/10780281005/job/29895846007 Due broken sync ``` actions/upload-artifact@v2 and actions/download-artifact@v4.1.7 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/135528 Approved by: https://github.com/kit1980, https://github.com/malfet	2024-09-09 20:48:52 +00:00
Maclyn Brandwein	cbc6b30a24	Fix broken E2E tests on Linux machines (#135394 ) Summary: I'm not entirely sure why this is failing with an `ImportError` (according to lastnameye a super class of `ModuleNotFoundError`s), but on our E2E tests on Linux machines (but not Macs?), we're seeing the import failure not getting caught -- `ImportError: cannot import name 'parutil' from 'libfb.py' (/data/sandcastle/boxes/eden-trunk-hg-full-fbsource/buck-out/v2/gen/fbsource/d0c916ec8d40ce11/arvr/libraries/ctrl/studies/replay/__ctrl-r__/ctrl-r#link-tree/libfb/py/__init__.py)` from this test run https://www.internalfb.com/sandcastle/workflow/2522015791331601269, an instance of this job: https://www.internalfb.com/intern/test/844425085172858?ref_report_id=0 is the overall job Test Plan: `arc skycastle schedule tools/skycastle/workflows2/ctrl/js_tests.sky:test_js_e2e_replay_tests --sandcastle-spec-overrides '{"type": "fbcode", "unicastle_size": "I1_MEDIUM"}'` -> https://www.internalfb.com/sandcastle/workflow/256705178764255769 Differential Revision: D62321167 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135394 Approved by: https://github.com/laithsakka	2024-09-09 20:18:08 +00:00
PyTorch MergeBot	5b368de7f7	Revert "[ONNX] Update fake mode usage in onnx docs (#135512 )" This reverts commit a13c118994b4f118388d97a35abcb91a396cd437. Reverted https://github.com/pytorch/pytorch/pull/135512 on behalf of https://github.com/davidberard98 due to failing test https://github.com/pytorch/pytorch/actions/runs/10778813316/job/29891679127 ([comment](https://github.com/pytorch/pytorch/pull/135512#issuecomment-2338999090))	2024-09-09 20:15:12 +00:00
Joel Schlosser	09a5e88bef	NJT <-> padded dense conversions (#125947 ) This PR: * Implements the pre-existing `nt.to_padded_tensor(padding_val)` ATen op via the FBGEMM kernel + appropriate view gymnastics (since that kernel only handles 2D values) * Introduces a new `_nested_from_padded_tensor` op for the reverse conversion, implemented via the reverse FBGEMM kernel + view gymnastics * Note: there is currently no public API for this; design booted to a future PR TODO: * ~~Propagate min / max sequence length via the new factory function `_nested_from_padded_tensor`~~ * ~~Verify that Inductor does computation fusion via test logic~~ Pull Request resolved: https://github.com/pytorch/pytorch/pull/125947 Approved by: https://github.com/soulitzer	2024-09-09 19:37:32 +00:00
Sahan Paliskara	a4e6a0b240	[split build] move periodic split builds into own concurrency group (#135510 ) To avoid nightly workflows cancelling each other Pull Request resolved: https://github.com/pytorch/pytorch/pull/135510 Approved by: https://github.com/clee2000, https://github.com/huydhn, https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2024-09-09 19:35:57 +00:00
imShZh	4ab232d0c4	Fix symbolic number's type and tensor's dtype mismatch bug in Tensor ctor (#135433 ) Fixes #135432 In the current implementation, if we try to store a symbolic number in Tensor's constructor, it assumes that the tensor's dtype and the symbolic number's type are matched, which is not the case. In other words, if we try to store a `SymInt`, current implementation assumes tensor's dtype is `torch.int32`, `torch.int64` or something. And if we try to store a `SymFloat`, it assumes tensor's dtype is `torch.float32` or `torch.float64`. However, the tensor's dtype could also be `torch.float32` or something else when we try to store `SymInt`, which would be wrong. This PR stores symbolic numbers by tensor's scalar type by wrapping `SymInt` and `SymFoat`'s guarded number into a PyObject. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135433 Approved by: https://github.com/ezyang	2024-09-09 19:32:18 +00:00
Sergii Dymchenko	2032f107d7	Don't try to tag s390x docker images (#135509 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135509 Approved by: https://github.com/atalman	2024-09-09 19:07:48 +00:00
rzou	5f7d956362	Fix bugs blocking flipping the default layout constraint for custom ops (#135391 ) Fixes two things: - For regular PyTorch ops, the default layout constraint tag is always flexible_layout. This was a bug with #135238 - Mark the new quantized _wrapped_linear_prepack ops as flexible_layout. The metas for these are incorrect, I didn't want to fix them (and changing the default requires the metas actually be correct). Test Plan: - The next PR up in the stack. The PRs are split because the next one is riskier. foo Pull Request resolved: https://github.com/pytorch/pytorch/pull/135391 Approved by: https://github.com/albanD	2024-09-09 18:24:21 +00:00
shubhambhokare1	a13c118994	[ONNX] Update fake mode usage in onnx docs (#135512 ) Update fake mode usage in onnx docs Pull Request resolved: https://github.com/pytorch/pytorch/pull/135512 Approved by: https://github.com/justinchuby	2024-09-09 18:10:37 +00:00
Chien-Chin Huang	21241bfeee	[CP] Extend CP to support load-balancing shards (#132442 ) This PR extends the current ring attention to support load-balancing shards -- the context/sequence is divided into `2 * world_size` shards and each rank gets `rank` and `(world_size * 2 - rank - 1)` shards. The data re-shuffling is done in the `context_parallel` API. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132442 Approved by: https://github.com/wconstab	2024-09-09 18:04:38 +00:00
PyTorch MergeBot	73a6fc6e30	Revert "[Inductor] Make static_input_idxs a set for faster lookup (#135314 )" This reverts commit 011cae9570fb3c44b7f6f0c8004c470579ed21da. Reverted https://github.com/pytorch/pytorch/pull/135314 on behalf of https://github.com/ZainRizvi due to Lint is failing on this file in trunk. See [GH job link](https://github.com/pytorch/pytorch/actions/runs/10777258770/job/29885960050) [HUD commit link](`011cae9570`) ([comment](https://github.com/pytorch/pytorch/pull/135314#issuecomment-2338678219))	2024-09-09 17:33:01 +00:00
Roy Hvaara	09287e3af4	[MPS] Add regression test for `fft.fftfreq` (#135440 ) The issue reported in #135223 was already solved in #128393. This PR adds a regression test for it. Fixes #135223 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135440 Approved by: https://github.com/ezyang	2024-09-09 17:12:36 +00:00
Bin Bao	16c3b8f87c	[AOTI] Fix assert_function call in cpu autotune template (#135086 ) Summary: In the ABI-compatible mode, assert_function should be AOTI_TORCH_CHECK. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135086 Approved by: https://github.com/chenyang78, https://github.com/angelayi ghstack dependencies: #134857	2024-09-09 16:54:12 +00:00
Bin Bao	9c6dff4941	[AOTI] Add C shim for aten.mkldnn_rnn_layer in cpp wrapper (#134857 ) Summary: Support aten.mkldnn_rnn_layer in the ABI-compatible mode. Because aten.mkldnn_rnn_layer is an aten op, it is easier to add a C shim function for it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134857 Approved by: https://github.com/angelayi	2024-09-09 16:54:12 +00:00
atalman	0eb425a563	[Release] Apply Release changes scripts after release 2.4 (#135495 ) Based on additional changes required for https://github.com/pytorch/pytorch/pull/128347 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135495 Approved by: https://github.com/kit1980	2024-09-09 16:49:04 +00:00
Victor Tao	011cae9570	[Inductor] Make static_input_idxs a set for faster lookup (#135314 ) `static_input_idxs` is only used for lookups. With large models, this is a large list. This takes over a millisecond in some cases. Profile before change: <img width="824" alt="image" src="https://github.com/user-attachments/assets/002a0775-fd2f-4d27-8cf2-812b502d7d5e"> Profile after change: gaps are smaller, 1ms speedup before launching the cuda graph <img width="794" alt="image" src="https://github.com/user-attachments/assets/12a0a0b9-2cc1-4d53-ac87-9bd5140a46f5"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/135314 Approved by: https://github.com/oulgen	2024-09-09 16:24:58 +00:00
CaoE	dfb2b661f7	Use float data type for Half var_sum in batchnorm stats updating on CPU (#126525 ) Using float data type for Half `var_sum` in batchnorm stats updating on CPU to avoid `var_sum` overflow since the representation range of Half is small. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126525 Approved by: https://github.com/jgong5, https://github.com/peterbell10	2024-09-09 15:31:38 +00:00
Roy Hvaara	5a69e0ebbe	[MPS] Update decorator comments with issue ref (#135448 ) Updating the comments with references to better places for context now that the bugs have been identified. xref #135442 #135447 #134184 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135448 Approved by: https://github.com/ezyang	2024-09-09 15:18:52 +00:00
Xavier Dupré	5e145861f2	[ONNX] Improves documentation of ONNX exporter (#135372 ) The PR updates the documentation to reflect the changes introduced in pytorch 2.5 and related to onnx exporter. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135372 Approved by: https://github.com/justinchuby Co-authored-by: Justin Chu <justinchuby@users.noreply.github.com>	2024-09-09 15:09:01 +00:00
Yuxin Wu	c35b953531	Fix wrong error msg (#135423 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135423 Approved by: https://github.com/ezyang	2024-09-09 13:28:31 +00:00
PHLens	dced0d6d9f	Add `__init__.py` to shape inference folder. (#135461 ) Fixes #135196 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135461 Approved by: https://github.com/ezyang	2024-09-09 13:27:58 +00:00
Jiong Gong	c0436c5701	[inductor][cpp][gemm] fix perf regression xcit_large_24_p8_224 (#134686 ) (#135438 ) Fix #134686. PR https://github.com/pytorch/pytorch/pull/132729 makes GEMM template faster for one of the GEMMs in xcit_large_24_p8_224: SingleProcess AUTOTUNE benchmarking takes 1.7088 seconds and 1.9207 seconds precompiling AUTOTUNE linear_unary(12544x3072, 768x3072, 768) cpp_packed_gemm_2 2.9371 ms 100.0% _linear_pointwise 3.1584 ms 93.0% But it is slower than Aten in the e2e run due to different cache behavior. The access to the input data (12544x3072) is LLC latency bound and bottlenecks seen due to the memory synchronization (data transfers and coherence updates across processors). This PR tries to mitigate the problem by cooperatively loading different chunks of input data from different processors that share the input data. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135438 Approved by: https://github.com/leslie-fang-intel	2024-09-09 05:16:02 +00:00
cyy	60e8dc4374	Check function declarations in Caffe2 code (#134925 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/134925 Approved by: https://github.com/ezyang	2024-09-09 05:03:29 +00:00
xingyunjohn1	e6c3f58584	Fix example: Address broadcasting error in the addition of `attn_bias… (#135427 ) …` and `attn_mask`, and correct device assignment for newly created variables in the method. Fix example: Address broadcasting error in the addition of `attn_bias` and `attn_mask`, and correct device assignment for newly created variables in the method. 1. Adding `attn_bias += attn_mask` results in a broadcasting error. The expected shape of `attn_bias` is (L, S), so the output should also have the shape (L, S). However, when the input shape is (N, num_heads, L, S), broadcasting occurs, leading to an output shape of (N, num_heads, L, S), which is not desired. 2. `attn_bias` is a newly created variable within the method, but it is not assigned to the correct device. This is my retry of PR #130209 . The PR has been merged into commit `d4a79d4a7c746068d25fe5cf9333495561f4ce1f`, but the modifications were overwritten by subsequent commits. Co-authored-by: mikaylagawarecki <mikaylagawarecki@gmail.com> @mikaylagawarecki provided a more elegant implementation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135427 Approved by: https://github.com/ezyang	2024-09-09 03:47:34 +00:00
PhilipMay	90e12cf63d	Fix return type of `nansum` example. (#135435 ) One of the examples in the documentation of `torch.nansum` contains a wrong return type. This fixes it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135435 Approved by: https://github.com/ezyang	2024-09-09 03:34:52 +00:00
Zhou, Lingzhi	44c08f4984	[Partitioner] Query whether nodes exist in graph faster (#135316 ) Find node if exist in graph.nodes (linked list) take too long time. Using graph._find_nodes_lookup_table (hash table) instead to speed up. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135316 Approved by: https://github.com/ezyang	2024-09-09 03:34:02 +00:00
Rafal Litka	b6186353c6	enable lazy_init for hpu (#135203 ) enables lazy_init for hpu device Pull Request resolved: https://github.com/pytorch/pytorch/pull/135203 Approved by: https://github.com/ezyang	2024-09-09 03:32:20 +00:00
Alexander Kurakin	b7eb7256fb	docs: `torch.nn.utils.rnn.pack_padded_sequence`: docs improve (#135417 ) docs: `torch.nn.utils.rnn.pack_padded_sequence`: docs improve /cc @mikaylagawarecki Pull Request resolved: https://github.com/pytorch/pytorch/pull/135417 Approved by: https://github.com/ezyang	2024-09-09 03:16:11 +00:00
Xu Han	c1ae78be92	[inductor] calibration inductor windows uts (18/N) (#135449 ) skip test_quantized_* UTs of `test/inductor/test_cpu_select_algorithm.py`. Windows inductor don't support quantize so far. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135449 Approved by: https://github.com/ezyang	2024-09-09 03:10:54 +00:00
yuqingj	defb515306	[NJT]Add permute ops support (#135336 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135336 Approved by: https://github.com/davidberard98	2024-09-08 21:00:41 +00:00
Jason Ansel	31c4e0d37d	[inductor] Cleanup analysis done at lowering time (#135412 ) Before this we would take multiple passes over the body of each IRNode as we did lowering. This combines most analysis into `OpCounterCSE` so it can be done in a single pass. Before: ![image](https://github.com/user-attachments/assets/0047db09-4258-4491-a9a6-b078e183092a) After: ![image](https://github.com/user-attachments/assets/1e03adcb-8303-4bb1-8bbb-cc42dacd44d7) This stack: ![image](https://github.com/user-attachments/assets/d6b50b24-c30c-4d23-8b1a-344b3ba65d7a) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135412 Approved by: https://github.com/oulgen ghstack dependencies: #135286, #135306, #135377, #135400	2024-09-08 18:02:36 +00:00
Jason Ansel	53290ca00b	[inductor] Refactor BaseSchedulerNode.__init__ (#135400 ) Might be a small compile time improvement since we remove a call to extract_read_writes(). Pull Request resolved: https://github.com/pytorch/pytorch/pull/135400 Approved by: https://github.com/oulgen ghstack dependencies: #135286, #135306, #135377	2024-09-08 18:02:36 +00:00
Jason Ansel	16f5155992	[inductor] Fast path for extract_read_writes without tracing (#135377 ) Before (bottom of stack): ![image](https://github.com/user-attachments/assets/13060ff9-b31d-42a9-8e8f-c50b2bf3dc2f) After (this PR): ![image](https://github.com/user-attachments/assets/7d190821-b614-46b7-9e9e-9087443df654) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135377 Approved by: https://github.com/oulgen ghstack dependencies: #135286, #135306	2024-09-08 18:02:32 +00:00
Jason Ansel	37144be03d	[inductor] Remove ReadWrites.op_counts (#135306 ) This was (almost) unused. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135306 Approved by: https://github.com/oulgen ghstack dependencies: #135286	2024-09-08 18:02:28 +00:00
Jason Ansel	3bdc54ed18	[inductor] Refactor LoopBody.memory_usage (#135286 ) This is preparing for some other changes where I speed up extract_read_writes tracing. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135286 Approved by: https://github.com/oulgen	2024-09-08 18:02:24 +00:00
cyy	2196f32475	[22/N] Fix clang-tidy warnings in jit (#135319 ) Follows #134537 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135319 Approved by: https://github.com/titaiwangms	2024-09-08 17:18:29 +00:00
Wanchao Liang	cfc227ad43	[reland][dtensor] move DTensor to public namespace (#134203 ) reland of https://github.com/pytorch/pytorch/pull/133113 I have to create a new PR because the previous reverted PR could not either be rebased, or imported successfully :( ---- Moving DTensor to be in the public namespace, to formally add the documentation page that includes all the public APIs. This includes: * many path renames and path import fixes * a dedicated doc page without too much content yet (adding in the next PRs) * To preserve the BC for users still using the torch.distributed._tensor, I added a shim script to redirect old path calls to the new module The BC preserving is evidented by the fact that all DTensor tests are still working without changing the public imports. So it's safe to land the changes Pull Request resolved: https://github.com/pytorch/pytorch/pull/134203 Approved by: https://github.com/tianyu-l	2024-09-08 17:08:40 +00:00
Animesh Jain	20cab91a12	[dynamo] Remove skip from jit freeze tests (#135281 ) Fixes https://github.com/pytorch/pytorch/issues/119781 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135281 Approved by: https://github.com/zou3519	2024-09-08 15:11:12 +00:00
CaoE	a6fae2e811	Use BRGEMM for Half flash attention forward kernel (#131879 ) Use oneDNN BRGEMM on packed data to get better performance on the 5th generation of Xeon where Intel® Advanced Matrix Extensions (AMX) will have fp16 support, e.g. amx-fp16. Multiple models have achieved acceleration, for instance, FP16 stable diffusion v2.1 has achieved over 50% improvement. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131879 Approved by: https://github.com/jgong5, https://github.com/peterbell10 ghstack dependencies: #131878	2024-09-08 12:32:23 +00:00
Justin Chu	042f2f7746	[ONNX] Re-raise the exception if the dynamic shapes cannot be refined (#135418 ) Improve error reporting. Otherwise users will just see not being able to refine shapes most of the time. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135418 Approved by: https://github.com/titaiwangms	2024-09-08 05:30:34 +00:00
Huamin Li	fd494dd426	Change wrapped_linear_prepack and wrapped_quantized_linear_prepacked to private by adding _ as prefix (#135401 ) Summary: In https://github.com/pytorch/pytorch/pull/134232, we added two new ops wrapped_linear_prepack and wrapped_quantized_linear_prepacked. From the review comments and offline discussion, we are changing them to private by adding `_` as prefix Differential Revision: D62325142 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135401 Approved by: https://github.com/houseroad	2024-09-08 04:16:24 +00:00
Bob Ren	8334cb2fb9	remove commented out breakpoints (#135363 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135363 Approved by: https://github.com/oulgen	2024-09-08 02:15:45 +00:00
Yanbo Liang	e72ed4717e	[Dynamo] Fix Huggingface PretrainedConfig get non const attr (#135413 ) Fixes #135329 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135413 Approved by: https://github.com/anijain2305	2024-09-07 19:16:29 +00:00
drisspg	3bebc09be9	[FlexAttention] Align the matmul tensorcore usage (#135168 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135168 Approved by: https://github.com/Chillee	2024-09-07 16:33:41 +00:00
Sam Larsen	a2db22e6bb	[inductor] Catch BrokenProcessPool and print a more helpful message. (#135120 ) Summary: BrokenProcessPool means a parallel-compile subprocess exited, which we never expect. It's likely due to a crash, so print a more meaningful error message and instructions that it's probably easier to debug by turning off parallel compile. Output looks like: ``` ... File "/data/users/slarsen/pytorch/torch/_inductor/runtime/compile_tasks.py", line 45, in _reload_python_module exec(code, mod.__dict__, mod.__dict__) File "/tmp/torchinductor_slarsen/4q/c4qw7xk5lbb7whg5txnk4hwbc7z6kepak3o666tr3d64gcad5r5b.py", line 815, in <module> async_compile.wait(globals()) File "/data/users/slarsen/pytorch/torch/_inductor/async_compile.py", line 265, in wait raise RuntimeError( RuntimeError: A compilation subprocess exited unexpectedly. This is likely due to a crash. To facilitate debugging, you can re-run with TORCHINDUCTOR_COMPILE_THREADS=1 to cause compilation to occur in the main process. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/135120 Approved by: https://github.com/Chillee	2024-09-07 16:33:37 +00:00
Jason Ansel	eac5e12548	[inductor] Move LoopBody to its own file (#135257 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135257 Approved by: https://github.com/oulgen	2024-09-07 16:29:15 +00:00
Wu, Chunyuan	18479c5f70	[Doc] update max-autotune for CPU (#134986 ) The current doc for `max-autotune` is applicable only for GPU. This PR adds the corresponding content for CPU. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134986 Approved by: https://github.com/jgong5, https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2024-09-07 13:42:40 +00:00
CaoE	f7c0c06692	Add oneDNN BRGEMM support on CPU (#131878 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/131878 Approved by: https://github.com/jgong5, https://github.com/peterbell10	2024-09-07 13:22:30 +00:00
Yu, Guangye	b53d97c7be	[Intel GPU] Add XPU memory-related APIs (#129919 ) # Motivation According to https://github.com/pytorch/pytorch/issues/116322, we will help unify the device allocator. So we introduce a simple xpu device allocator only with the key functionality first. And expect to add some memory statistics-related functionality after the unification. But now, some memory statistic-related APIs listed in https://github.com/pytorch/pytorch/issues/127929 are requested. We need more time to unify the device allocator. In order to facilitate the user experience, we expect to support these memory statistic-related APIs before the unification. # Additional Context Fixes: #127929 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129919 Approved by: https://github.com/dvrogozh, https://github.com/abhilash1910, https://github.com/gujinghui, https://github.com/EikanWang, https://github.com/albanD ghstack dependencies: #130923	2024-09-07 11:15:17 +00:00
Yu, Guangye	6c1da66407	[Reland] Refactor caching device allocator utils (#130923 ) # Motivation Following [[RFC] Intel GPU Runtime Upstreaming for Allocator ](https://github.com/pytorch/pytorch/issues/116322), this PR aims to refactor caching device allocator utils to improve code reuse usage. This is the first PR, we could prepare some follow-up PRs continuing to refactor the device caching allocator. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130923 Approved by: https://github.com/EikanWang, https://github.com/gujinghui, https://github.com/albanD, https://github.com/eqy	2024-09-07 11:14:17 +00:00
Jiong Gong	d7c97e7245	[inductor][cpp][gemm] cache blocking config for dynamic shapes (#133538 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133538 Approved by: https://github.com/leslie-fang-intel ghstack dependencies: #135277, #133447 Co-authored-by: Wu, Chunyuan <chunyuan.wu@intel.com>	2024-09-07 11:09:30 +00:00
Jiong Gong	be9f4ffe88	[inductor][cpp][gemm] enable dynamic M for k-slicing (#133447 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133447 Approved by: https://github.com/leslie-fang-intel ghstack dependencies: #135277 Co-authored-by: Wu, Chunyuan <chunyuan.wu@intel.com>	2024-09-07 11:09:30 +00:00
Jiong Gong	692faa9bc6	[inductor][cpp][gemm] reduce memory alloc overhead by allocating local acc once per thread (#135277 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135277 Approved by: https://github.com/leslie-fang-intel Co-authored-by: Wu, Chunyuan <chunyuan.wu@intel.com>	2024-09-07 11:09:25 +00:00
Justin Chu	32f3af72b7	[ONNX] Support FakeTensor in ONNXProgram (#135399 ) Sync with https://github.com/justinchuby/torch-onnx/compare/v0.1.20...v0.1.21 to support FakeTensors in ONNXProgram. Specifically, this PR implements the `apply_weights` method to allow users to supply a dictionary of concrete tensors to replace FakeTensors in the exported model weights. An error is raised when users try to serialize a FakeTensor to avoid segfaults. Also fixed a bug in `.save()` when `keep_initializers_as_inputs` is True and `include_initializers` is False. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135399 Approved by: https://github.com/titaiwangms	2024-09-07 04:48:18 +00:00
Yanbo Liang	ebab5c85c4	[FlexAttention] Skip very small block size unit tests on H100 due to Triton bug (#135393 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135393 Approved by: https://github.com/BoyuanFeng	2024-09-07 04:35:22 +00:00
Justin Chu	3d734d837b	[ONNX] Handle mixed sequence inputs properly (#135378 ) Previously, when an input contains a mixture of `Value` and python constants like `[SymbolicTensor('sym_size_int_3', type=Tensor(INT64), shape=[], producer=node_Shape_0, index=0), 512]`, we get errors like ```pytb Traceback (most recent call last): File "/Users/justinc/Documents/GitHub/torch-onnx/src/torch_onnx/_building.py", line 367, in _call_op converted_named_inputs = _process_python_constants_and_sequences( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/justinc/Documents/GitHub/torch-onnx/src/torch_onnx/_building.py", line 275, in _process_python_constants_and_sequences raise TypeError( TypeError: Constant input '[SymbolicTensor('sym_size_int_3', type=Tensor(INT64), shape=[], producer=node_Shape_0, index=0), 512]' of type '<class 'list'>' is not supported ``` This PR updates Sequence handling to support this case, as well as variadic inputs and ONNX Sequence inputs. Synced from https://github.com/justinchuby/torch-onnx/pull/187 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135378 Approved by: https://github.com/titaiwangms	2024-09-07 03:07:39 +00:00
Yiming Zhou	c92227c41a	[quant][pt2e] fix placeholder typo and related quantization tests (#135379 ) A previous typo on "placeholder" and related tests in quantization are fixed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135379 Approved by: https://github.com/jerryzh168	2024-09-07 02:31:43 +00:00
blaine-rister	e6a0221fc6	[Inductor] Optionally allow padding on non-GPU devices (#135280 ) This is the OSS component of a larger MTIA diff. Currently, Inductor disables padding for non-GPU devices. We need to change this behavior to enable padding on MTIA. This PR adds a config option to enable padding on the CPU, or any other non-GPU device. In the future, we might want to enable padding on all devices by default. However, that might require supporting device-dependent padding defaults, since CPUs will likely use different settings than H100 GPUs. Differential Revision: D61038114 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135280 Approved by: https://github.com/jfix71, https://github.com/shunting314	2024-09-07 02:19:14 +00:00
Justin Chu	a6b9d444fb	[ONNX] Refactor exporter errors (#135180 ) Refactor exporter errors to combine old errors and new errors for API consistency. This PR also 1. Removes the `_C._check_onnx_proto(proto)` call in the old exporter. We don't need the ONNX checker because it is limited. 2. Removes the `OnnxExporterError` defined in the dynamo module. This class unnecessarily stores the onnx program object, making it very bulky. Instead, we revert to use the plain OnnxExporterError defined in the `errors` module and use it as the base class for all errors. 3. Continues to expose `OnnxExporterError` in `torch.onnx` and the rest of the errors in `torch.onnx.errors`. 4. Removes the `CheckerError` and `InvalidExportOptionsError` from `torch.onnx`. This is BC breaking but should have low impact. 5. I did not rename existing errors out of compatibility considerations, even though `ExporterError` would have been more succinct. Fixes https://github.com/pytorch/pytorch/issues/135125 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135180 Approved by: https://github.com/titaiwangms	2024-09-07 00:50:15 +00:00
Sergii Dymchenko	d42b0c8f22	Add release matrix for 2.5 (#135383 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135383 Approved by: https://github.com/huydhn	2024-09-07 00:49:53 +00:00
Will Feng	941d094dd1	[Dynamo][DTensor] Fixes SymNodeVariable() is not a constant error in Compiled DDP + TP unit test (#135315 ) Before the fix, the unit test will fail at forward Dynamo tracing: ``` File "/data/users/willfeng/pytorch/test/distributed/_composable/test_replicate_with_compiler.py", line 415, in test_ddp_tp loss = compiled_replicate_model(data).sum() ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ... torch._dynamo.exc.InternalTorchDynamoError: SymNodeVariable() is not a constant from user code: File "/data/users/willfeng/pytorch/torch/distributed/tensor/parallel/_data_parallel_utils.py", line 34, in _unflatten_tensor result = DTensor.from_local( ``` After the fix, the compilation fails at a later step (Compiled Autograd tracing), due to needing "pre-dispatch tracing of backward graph" feature (see details at https://github.com/pytorch/pytorch/issues/127797#issuecomment-2291695474). I believe this PR is a net improvement, because it should also fix the 1D Traceable FSDP2 failure case on internal models (https://github.com/pytorch/pytorch/issues/130978#issuecomment-2319476690), which is much harder to build a minimal unit test for. Fixes https://github.com/pytorch/pytorch/issues/130978. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135315 Approved by: https://github.com/bdhirsh	2024-09-07 00:11:25 +00:00
Shangdi Yu	b1a934741e	Change test_constant_prop_preserve_metadata (#135268 ) Summary: In new export_for_training, "stack_trace" does not exist in node meta anymore. Test Plan: ``` buck run fbcode//mode/dev-nosan fbcode//caffe2/test:quantization_pt2e -- -r test_constant_prop_preserve_metadata ``` Reviewed By: angelayi Differential Revision: D62219974 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135268 Approved by: https://github.com/angelayi	2024-09-07 00:02:35 +00:00
Sahan Paliskara	0c661f3e1a	[Split Build] Refactor split build binary builds into their own workflows and move split build binary builds to periodic (#134624 ) As we need to move split build binary tests from trunk to periodic this pr, refactors those jobs out into its own workflow to achieve this. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134624 Approved by: https://github.com/malfet	2024-09-06 23:57:56 +00:00
leslie-fang-intel	2c7e314803	[Inductor][CPP] Fix the issue of view dtype (#135301 ) Summary Fix issue: https://github.com/pytorch/pytorch/issues/135160, it's a regression introduced by https://github.com/pytorch/pytorch/pull/134569, where the dtype of `to_dtype_bitcast` was incorrectly handled when using the scalarize implementation. TestPlan ``` python -u -m pytest -s -v test/inductor/test_cpu_repro.py -k test_view_dtype ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/135301 Approved by: https://github.com/jgong5, https://github.com/jansel	2024-09-06 23:36:44 +00:00
Sun, Jiayi	ead4407f57	[inductor] Fix loop split optimization (#135303 ) Fix https://github.com/pytorch/pytorch/issues/135274. Improve the check whether the div expr matches: add a check whether `split_var` is in `original_body.iter_vars`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135303 Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel, https://github.com/jansel	2024-09-06 23:06:25 +00:00
Henry Tsang	2f5b40c099	[aoti test] Disable FP8 funz dtypes in fp8 runtime check test (#135373 ) Fixing https://github.com/pytorch/pytorch/issues/126734 Key is the funz FP8 types are for AMD only. source: https://github.com/openxla/stablehlo/blob/main/rfcs/20230321-fp8_fnuz.md Pull Request resolved: https://github.com/pytorch/pytorch/pull/135373 Approved by: https://github.com/chenyang78	2024-09-06 23:05:47 +00:00
Yidi Wu	993b5647ab	[export] fix placeholder name collision tests by removing map call (#135366 ) The current test is failing because of the current unstable state of map. torch.compile and non-strict export are taking two seperate routes unlike cond and while_loop. This pr fix the test it self. We'll fix map in follow up PRs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135366 Approved by: https://github.com/angelayi	2024-09-06 22:02:50 +00:00
Sam Larsen	2ab26806f1	Require tlparse for failing tests in test_structured_trace.py (#135376 ) Summary: These tests are currently failing internally. Per discussion, skip if tlparse is unavailable Test Plan: ``` feature remove tlparse buck2 test 'fbcode//mode/opt' fbcode//caffe2/test/dynamo:test_dynamo -- --run-disabled --regex test_structured_trace.py feature install tlparse buck2 test 'fbcode//mode/opt' fbcode//caffe2/test/dynamo:test_dynamo -- --run-disabled --regex test_structured_trace.py ``` Differential Revision: D62310342 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135376 Approved by: https://github.com/ezyang	2024-09-06 21:53:41 +00:00
Jane Xu	b1612569f6	[BE] Clarify defaulting behavior in optimizer (#135384 ) Fixes #135340 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135384 Approved by: https://github.com/drisspg, https://github.com/jainapurva	2024-09-06 21:52:55 +00:00
Will Constable	dc0e818738	[FR] Automatically infer a common filename prefix (#135158 ) Save the annoyance of specifying this on the command line each time Pull Request resolved: https://github.com/pytorch/pytorch/pull/135158 Approved by: https://github.com/fduwjj, https://github.com/c-p-i-o ghstack dependencies: #135157	2024-09-06 21:44:27 +00:00
Will Constable	06e414d7fe	[FR] Make trace_dir a required argument (#135157 ) Ensures users get a clean error if they forget to specify the dir, and improves the help message. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135157 Approved by: https://github.com/c-p-i-o, https://github.com/fduwjj	2024-09-06 21:44:27 +00:00
PyTorch MergeBot	a681260caf	Revert "[ONNX] Refactor exporter errors (#135180 )" This reverts commit 5eebd9315a72422d59b6f8d8ca8e4e573e231d5c. Reverted https://github.com/pytorch/pytorch/pull/135180 on behalf of https://github.com/clee2000 due to I think this broke test_public_bindings.py::TestPublicBindings::test_correct_module_names [GH job link](https://github.com/pytorch/pytorch/actions/runs/10743909338/job/29800779403) [HUD commit link](`5eebd9315a`), possibly a landrace with the PR that landed before it ([comment](https://github.com/pytorch/pytorch/pull/135180#issuecomment-2334844191))	2024-09-06 21:39:18 +00:00
William Wen	95e976a63f	[dynamo] recursively skip frames when Dynamo cache limit is hit (#135144 ) Fixes https://github.com/pytorch/pytorch/pull/135144 and [T197117723](https://www.internalfb.com/intern/tasks/?t=197117723). In general, adds `SkipCodeRecursiveException` to Dynamo - when raised in Dynamo, convert_frame will return a `skip_code_recursive_flag` back to C Dynamo, signaling it to skip the current frame and all recursive calls. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135144 Approved by: https://github.com/jansel, https://github.com/anijain2305	2024-09-06 21:38:53 +00:00
Catherine Lee	306ac44eaa	[ez][TD] Fix request for issue body returns None (#135389 ) I assumed it would be empty string if the body is empty, but its just None Pull Request resolved: https://github.com/pytorch/pytorch/pull/135389 Approved by: https://github.com/malfet	2024-09-06 21:02:01 +00:00
Vadym Khortiuk	a7643baceb	Revert expectFailureIf condition on tests with torch.compile on Windows (#134759 ) Fixes #134716 This PR reverts some changes introduced in `6eae569546` (#133987) torch.compile is not available on Windows, tests should be expected to fail. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134759 Approved by: https://github.com/malfet	2024-09-06 20:51:55 +00:00
William Wen	a4030e37be	[dynamo] reland map/zip iterator related changes (#135074 ) Differential Revision: [D62211019](https://our.internmc.facebook.com/intern/diff/D62211019) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135074 Approved by: https://github.com/jansel, https://github.com/anijain2305, https://github.com/mlazos	2024-09-06 20:38:02 +00:00
Henry Tsang	22e1fb6faa	[test][easy] Add debug utils for cpu select algorithm test (#135038 ) Summary: Add debug utils to debug a flaky test in fbcode ci. Some context: https://github.com/pytorch/pytorch/pull/126545 Test Plan: ci Differential Revision: D62005445 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135038 Approved by: https://github.com/jgong5, https://github.com/XuehaiPan	2024-09-06 20:30:49 +00:00
titaiwangms	2a4890e315	[ONNX] Clean up the missed lines from previous PRs (#135368 ) Some missed deleted lines Pull Request resolved: https://github.com/pytorch/pytorch/pull/135368 Approved by: https://github.com/justinchuby	2024-09-06 20:27:52 +00:00
Tristan Rice	3ce433aef2	[TCPStore] use wait counters (#135283 ) This replaces the existing TCPStore counters with the new shared wait counters. There's no users of the tcpstore counters so should be completely safe to remove. Test plan: Existing tests + build There's no OSS backend for wait counters so can't write any tests with them currently. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135283 Approved by: https://github.com/c-p-i-o	2024-09-06 19:54:25 +00:00
Jane Xu	7f2d20e687	Run all autograd node post hooks (#134728 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134728 Approved by: https://github.com/albanD, https://github.com/soulitzer	2024-09-06 19:44:28 +00:00
titaiwangms	32fd29c1ea	[ONNX] Properly handle Attributes in traceable functions (#135367 ) Previously the attributes were sent in as Attr objects even when we call the function as a plain Python function. Turning them into python objects. From https://github.com/justinchuby/torch-onnx/pull/186 Related https://github.com/microsoft/onnxscript/issues/1846 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135367 Approved by: https://github.com/justinchuby	2024-09-06 19:35:22 +00:00
Justin Chu	5eebd9315a	[ONNX] Refactor exporter errors (#135180 ) Refactor exporter errors to combine old errors and new errors for API consistency. This PR also 1. Removes the `_C._check_onnx_proto(proto)` call in the old exporter. We don't need the ONNX checker because it is limited. 2. Removes the `OnnxExporterError` defined in the dynamo module. This class unnecessarily stores the onnx program object, making it very bulky. Instead, we revert to use the plain OnnxExporterError defined in the `errors` module and use it as the base class for all errors. 3. Continues to expose `OnnxExporterError` in `torch.onnx` and the rest of the errors in `torch.onnx.errors`. 4. Removes the `CheckerError` and `InvalidExportOptionsError` from `torch.onnx`. This is BC breaking but should have low impact. 5. I did not rename existing errors out of compatibility considerations, even though `ExporterError` would have been more succinct. Fixes https://github.com/pytorch/pytorch/issues/135125 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135180 Approved by: https://github.com/titaiwangms	2024-09-06 19:10:56 +00:00
Nowtryz	a15aabc975	Add MaskedTensor passthrough: unfold, F.Unfold, F.Fold, stack (#125262 ) Hi, I noticed the `unfold` operator was missing on MaskedTensor. I tested that my change works when calling unfold and backward on a `MaskedTensor` but I didn't find the tests for the dispatch of such operation. Where is it? Pull Request resolved: https://github.com/pytorch/pytorch/pull/125262 Approved by: https://github.com/cpuhrsch	2024-09-06 19:06:23 +00:00
Jokeren	b143426db3	[Inductor] Use argument names as the key for the `constants` dict and the `signature` dict (#135170 ) Referencing how triton constructs these dictionaries `ca3fb5f6fa/python/triton/runtime/jit.py (L639)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/135170 Approved by: https://github.com/htyu	2024-09-06 19:05:00 +00:00
Oguz Ulgen	13ba0a2e5c	Run bypassed graph compile outside the except block to avoid chaining of exceptions (#135175 ) Fixes #135172 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135175 Approved by: https://github.com/masnesral, https://github.com/ezyang	2024-09-06 19:03:57 +00:00
wdziurdz	8520ce5f78	Fix incorrect trace of post-accumulate grad hook on tensor with zero dims (#135226 ) Fix incorrect trace of post-accumulate grad hook on tensor with zero dimensions Fixes #135207 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135226 Approved by: https://github.com/xmfan	2024-09-06 18:19:54 +00:00
Tristan Rice	196748d491	[elastic] support local_addr across all rendezvous impls (#135262 ) Summary: There was a regression introduced in https://github.com/pytorch/pytorch/pull/125743 that made `local_addr` no longer used. This fixes that by passing `local_addr` to `RendezvousStoreInfo.build` everywhere it's used. This also fixes a number of tests allowing them to be run in parallel which hugely sped up the testing cycle as this change touches many different rendezvous implementations. This required a few fixes in unrelated tests. Test Plan: Added tests for the common rendezvous implementations that `local_addr` to prevent future regressions. ``` buck2 test @//mode/dev-nosan fbcode//caffe2/test/distributed/elastic/... fbcode//caffe2/torch/distributed/elastic/... -- --stress-runs 3 ``` To vet the parallelism changes I also ran with 3 stress runs each to identify flakiness caused by parallelism. Differential Revision: D62256407 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135262 Approved by: https://github.com/fduwjj, https://github.com/wz337	2024-09-06 17:55:43 +00:00
Pian Pawakapan	177e4f4218	remove _check call on item() for torch.istft (#135234 ) Fixes #135014 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135234 Approved by: https://github.com/tugsbayasgalan	2024-09-06 17:31:25 +00:00
Henry Tsang	3988b3468b	[aoti][easy] remove breakpoint() in wrapper.py (#134807 ) Differential Revision: D61687146 Remove an unintended breakpoint in code. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134807 Approved by: https://github.com/YUNQIUGUO	2024-09-06 17:25:05 +00:00
Zhengxu Chen	04118d8617	[export] Record the global torch version in serialization. (#135243 ) Summary: In general I think it will be useful to also record the global torch version in the EP, so that we can track them in the logging in addition to the schema version. Test Plan: CI Reviewed By: henryoier Differential Revision: D62252626 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135243 Approved by: https://github.com/yushangdi	2024-09-06 17:02:06 +00:00
Riley Dulin	24482e5c68	[torch][fx] Set maximum warning count during fx.Graph.lint (#135069 ) Summary: resnet152 spent about 15 minutes writing warning messages in _unlift during `to_executorch` because they're all written to unbuffered stderr by the `warnings` module. These warnings are almost always about get_attr nodes referencing a non-existent name: ```lang=py warnings.warn(f'Node {node} target {node.target} {atom} of {seen_qualname} does ' 'not reference an nn.Module, nn.Parameter, or buffer, which is ' 'what \'get_attr\' Nodes typically target' ) ``` I'm not aware of a way to configure the warnings module to write this out at most once, so I'm just going to disable the lint for now. Test Plan: Re-ran resnet152 with Executorch and the XNNPackBackend, it is much faster now Differential Revision: D62156090 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135069 Approved by: https://github.com/yushangdi	2024-09-06 16:41:59 +00:00
yanbing-j	c0ec599f27	Update submodule ideep to include aarch64 change (#134897 ) This PR is per ARM request, which is in https://github.com/intel/ideep/issues/334. Context for the request is: Arm team has upstreamed the dynamic quantization changes, all the PRs were merged (torch, ideep, oneDNN), but without this ideep submodule update, the feature will not work. The change is isolated to only matmul operator and quantization path alone. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134897 Approved by: https://github.com/jgong5, https://github.com/atalman, https://github.com/snadampal	2024-09-06 16:40:26 +00:00
Alfredo Tupone	7074de43c0	Porting to GCC 15 (#135188 ) uint8_t is found on cstdint header Pull Request resolved: https://github.com/pytorch/pytorch/pull/135188 Approved by: https://github.com/Skylion007	2024-09-06 16:16:53 +00:00
Rachel Guo	771dcce11d	[AOTI][Tooling][6/n] Fix long dtype input tensors calling `mean()` in `aoti_torch_print_tensor_handle` (#135072 ) Differential Revision: D61635232 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135072 Approved by: https://github.com/hl475, https://github.com/ColinPeppler	2024-09-06 15:59:32 +00:00
Avik Chaudhuri	de74aafff4	error on exporting ScriptModule (#135302 ) Test Plan: added test Differential Revision: D62279179 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135302 Approved by: https://github.com/yushangdi	2024-09-06 15:12:40 +00:00
rzou	ad29a2c0dc	Add Inductor config for default stride behavior (#135238 ) By default, Inductor is allowed to manipulate the layout (strides+storage offset) of input tensors to custom operators. We want to change it so that the default is that Inductor should respect the stride order of input tensors to custom operators. This PR adds a config to toggle the behavior, in the next PR up we'll change the default. We also make the following changes: - We add a new operator Tag (flexible_layout), which means that inductor is allowed to manipulate the layout. When we flip the default, users can specify they want the old behavior by using this tag. This is a reland of https://github.com/pytorch/pytorch/pull/126986, which was previously reverted due to silent incorrectness. We've since fixed the silent incorrectness (https://github.com/pytorch/pytorch/pull/133639) Test Plan: - new test Pull Request resolved: https://github.com/pytorch/pytorch/pull/135238 Approved by: https://github.com/albanD	2024-09-06 14:48:24 +00:00
Yiwen Shi	3a9e33dca8	[torchelastic] Don't do signal handling when off the main thread (#135088 ) Summary: In multiprocessing, signal handling is not possible if the thread is not the main thread. This resulted in the following error: > "ValueError('signal only works in main thread of the main interpreter')" To address this issue, the diff checks whether the thread is the main thread and, if not, skips signal handling. Test Plan: Before this change, MAST job failed: https://fburl.com/mlhub/iq2m10v8 With this change, MAST job succeeded: https://fburl.com/mlhub/q6kb8343 Differential Revision: D62166943 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135088 Approved by: https://github.com/d4l3k	2024-09-06 14:47:03 +00:00
David Berard	a086882d72	[inductor][triton] mark workspace args as mutated (#134648 ) SplitScan makes use of a workspace arg that needs to be zeroed before it is used - then, it is used to communicate between thread blocks during the triton kernel implementation. It is mutated during during the execution of the kernel, so it should be marked as such. Before this PR, it is not marked as mutated; AFAIK this is fine during normal execution, but during autotuning it causes problems. The workspace starts off zeroed (as expected), but during autotuning the kernel will be executed multiple times and the workspace does not get re-set between executions, resulting in incorrect data. If the data is used for indexing, then you can fail device-side asserts (and the results after the initial run (with autotuning) could be wrong). The test added in this PR repros the issue when the fix is removed. When we mark the arg as mutated, then the arg gets cloned before autotuning, so that the arg passed to the kernel during autotuning will always be zeroed as expected. `804852c1f9/torch/_inductor/runtime/triton_heuristics.py (L685-L689)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/134648 Approved by: https://github.com/peterbell10, https://github.com/jansel	2024-09-06 14:23:37 +00:00
Will Feng	84ae6b7d6b	AOTDispatcher: limit cases when we detach() graph inputs to non-leaves (#134193 ) This PR is slightly a revival / update to the discussion from https://github.com/pytorch/pytorch/pull/98960: Part of FSDP2's tracing strategy right now is that: (1) it is painful/difficult to handle the case where we have multiple graph input tensors that are aliased to each other and at least one of them is duplicated (2) we already have longstanding in logic to remove duplicate input tensors from the graph in dynamo. Morally, FSDP2 gives us duplicate input tensors in the backward graph for every `unsharded_param`, because we have (a) the `unsharded_param` being closed over by the backward hook to resize/allgather, and (b) the same `unsharded_param` being saved for backward by autograd (we now guarantee in the partitioner that we will always save the base tensor for backward and recompute views) (3) However, we were still seeing cases where the `unsharded_param` showed up twice in the backward graph inputs, as distinct tensor objects (with different python ids) instead of being true duplicates that dynamo can de-dup. It turns on that this was because we were `.detach()`ing the `unsharded_param` in AOTDispatcher before plumbing it through the compiled forward (and so autograd would save a detach'd version of the `unsharded_param`). This is precisely because of the logic from https://github.com/pytorch/pytorch/pull/98960. However, re-reading the detailed comments, it seems unnecessary to do a detach() on a graph input that is a (leaf) `nn.Parameter`, even if it happens to get no gradients in the backward. Since it is a leaf, we don't have to worry about the autograd engine "continuing to backprop through the graph beyond the current tensor" (the leaf has no other grad_fn for autograd to backprop through). So this PR makes us a bit less aggressive about calling detach() on inputs: we only do it when: (1) our graph input statically will get a `None` gradient (and also has no metadata mutations, the existing state) (2) and our graph input is a non-leaf tensor (so detach()ing is actually required to prevent autograd from incorrectly backpropping past the non-leaf. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134193 Approved by: https://github.com/yf225 Co-authored-by: Will Feng <yf225@cornell.edu>	2024-09-06 14:06:48 +00:00
Julia Guo	60a097a071	[CD] Update binary_linux_test.sh to include calling builder smoke test (#133869 ) Run smoke test Fixes #1969 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133869 Approved by: https://github.com/atalman Co-authored-by: Andrey Talman <atalman@fb.com>	2024-09-06 13:27:24 +00:00
Wu, Chunyuan	13bae39e22	[inductor] [cpp] improve cache blocking for is_dynamic_M (#131306 ) ## Performance Models with >= 3% performance speedup are listed below: ### AMP single-thread dynamic shape (measured on CPU with AMX support) No regressions \| Model Family \| Model Name \| Speedup \| \|--------------\|------------\|---------\| torchbench \| soft_actor_critic\| 3% Pull Request resolved: https://github.com/pytorch/pytorch/pull/131306 Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel ghstack dependencies: #135275 Co-authored-by: Jiong Gong <jiong.gong@intel.com>	2024-09-06 13:21:24 +00:00
Jiong Gong	4ef6c05f65	[inductor][cpp][gemm] fix autotune runtime error from linear_binary fusion (#135275 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135275 Approved by: https://github.com/leslie-fang-intel	2024-09-06 13:21:23 +00:00
Edward Z. Yang	d6b9bd3e60	Also handle compiler collective when input variable doesn't exist on all ranks (#135147 ) Internal xref: https://fb.workplace.com/groups/3095840833991792/permalink/3810738595835342/ Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/135147 Approved by: https://github.com/jansel	2024-09-06 13:18:36 +00:00
Edward Z. Yang	d0591f4658	Ignore fresh unbacked when doing recursive make_fx inside HOPs (#135053 ) Internal xref: https://fb.workplace.com/groups/6829516587176185/posts/7705964779531357/ This now also incorporates a test from https://github.com/pytorch/pytorch/pull/133585 (which it fixes) and the prep PR https://github.com/pytorch/pytorch/pull/134407 Including the PR desc from that: I am trying to fix a problem reported by user in [fb.workplace.com/groups/6829516587176185/permalink/7705964779531357](https://fb.workplace.com/groups/6829516587176185/permalink/7705964779531357/) The summary of this problem is that when we do collect metadata analysis in AOTAutograd, we accumulate pending unbacked symbols which are going to be discarded at the end of the trace. However, if we do a recursive make_fx inside tracing, as occurs with torch.cond, we end up seeing that there are pending unbacked symbols that aren't associated with a binding, even though it's spurious (they've leaked into the inner make_fx call from the outer AOTAutograd analysis). In https://github.com/pytorch/pytorch/pull/133588 I tried to just prevent adding the symbols to the pending list at all in the first place. But this itself caused some problems which were fixed in https://github.com/pytorch/pytorch/pull/124785 . The problem fixed in that PR is that when we allocate tangents that have unbacked size, something prevented them from having correct unbacked SymInts when ignore fresh unbacked SymInts was enabled. So I had patched it at the time by just not suppressing pending symbols and clearing them out some other way. I think... I was wrong in that PR? That is to say, it was OK to avoid putting the fresh unbacked symbols in the pending list; the real problem was suppressing unbacked renamings. But there doesn't seem to be a good reason to suppress these; this PR shows that it doesn't actually fail any tests if you do these anyway. Intuitively, this makes sense, because you can't trigger renamings unless you're actually adding unbacked symbols to the pending set. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/135053 Approved by: https://github.com/ydwu4	2024-09-06 13:13:15 +00:00
Yan Zhiwei	b5dea061c8	check compilation status before query cudnn version in conv (#135332 ) This PR is created for fixing the https://github.com/pytorch/pytorch/issues/135322. The cudnn compilation status should be check firstly before querying version, otherwise, conv may trigger runtimeerror before any check in other non-cuda backends. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135332 Approved by: https://github.com/EikanWang, https://github.com/atalman	2024-09-06 12:50:04 +00:00
Michael Lazos	041960a1ce	[Dynamo] Automatically in-graph traceable tensor subclass ctors (#135151 ) Fixes https://github.com/pytorch/pytorch/issues/114389 Previously, dynamo would attempt to trace through the `__init__` of traceable tensor subclasses, since their constructors are AOT dispatcher traceable by definition, dynamo should automatically put these in the graph like we do for any other tensors. Not doing this is difficult because dynamo would need to apply mutations post tensor subclass creation in the graph. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135151 Approved by: https://github.com/bdhirsh	2024-09-06 12:23:38 +00:00
Sun, Jiayi	67c7924ea1	[inductor] Fix gen_transposed_tile_load_store (#135307 ) Recent PR: https://github.com/pytorch/pytorch/pull/131745 bring new VLA logical in cpp codegen. And it will raise build fail error on MSVC and error code is `Compiler Error C2131`: https://learn.microsoft.com/en-us/cpp/error-messages/compiler-errors-1/compiler-error-c2131?view=msvc-170 reproduce UT: ```cmd pytest test\inductor\test_torchinductor_dynamic_shapes.py -v -k test_large_block_sizes_dynamic_shapes_cpu ``` Original generated code: ```c++ alignas(16) float tmp1[static_cast<int64_t>(((-256LL)(c10::div_floor_integer(static_cast<int64_t>(ks1), static_cast<int64_t>(16LL)))) + (16LLks1))]; ``` Changes: allocate a large-enough fixed-sized buffer. New genarated code: ```c++ alignas(16) float tmp1[16*16]; ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/135307 Approved by: https://github.com/jgong5, https://github.com/jansel	2024-09-06 10:44:08 +00:00
penguin-wwy	217ba7b2ab	[Docs] Update FileCheck doc (#135199 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135199 Approved by: https://github.com/soulitzer	2024-09-06 08:18:38 +00:00
CaoE	758d515d98	[Inductor][CPP] Select tiling factor for lower precision data types (#133830 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133830 Approved by: https://github.com/jgong5, https://github.com/jansel	2024-09-06 08:12:37 +00:00
Feng Yuan	60d98b4cfb	Update torch-xpu-ops pin (ATen XPU implementation) (#135300 ) Release cycle for PyTorch 2.5 1. Bugfixing: correct reduction logic in cdist kernel. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135300 Approved by: https://github.com/EikanWang	2024-09-06 07:30:09 +00:00
Shangdi Yu	590a3e9f8a	[export][training ir migration] quantized_decomposed.quantize_per_tensor decomposition (#134525 ) Summary: In graph of TestXNNPACKQuantizer.test_dynamic_linear_with_con test, some quantized_decomposed.quantize_per_tensor.default ops are becoming quantized_decomposed.dequantize_per_tensor.tensor ops when using the new training ir. This is because we lift params/buffers before calling make_fx. So previously, for the graph that’s passed to make_fx,`graph.L__self___linear1.weight` is a tensor now in training ir, graph.L__self___linear1.weight is a FakeTensor. This caused the node overload to be different. Test Plan: ``` buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test/quantization:test_quantization -- -r test_dynamic_linear_with_conv ``` Differential Revision: D61364547 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134525 Approved by: https://github.com/tugsbayasgalan, https://github.com/jerryzh168	2024-09-06 07:06:06 +00:00
drisspg	764ee6e3f9	[FlexAttention] Specify padding_value for boundary checked loads (#134573 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134573 Approved by: https://github.com/Chillee	2024-09-06 06:47:26 +00:00
wz337	67f98a99a4	[DeviceMesh][Easy] Make RuntimeError a bit more descriptive by including the actual world_size (#135271 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135271 Approved by: https://github.com/fduwjj	2024-09-06 06:23:20 +00:00
fduwjj	e020a8755a	[Fix][FR][ez] Remove debugging logs (#135308 ) Removing the print added during debugging process. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135308 Approved by: https://github.com/wz337	2024-09-06 06:14:33 +00:00
Jason Ansel	7ffb3b201c	[inductor] Remove LoopBody.reads,writes,other (#135256 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135256 Approved by: https://github.com/oulgen ghstack dependencies: #135070, #135076, #135082, #135084, #135079, #135235	2024-09-06 06:11:55 +00:00
Jason Ansel	f946bf88c4	[inductor] Skip retracing an existing LoopBody (#135235 ) This is roughly a 7% speedup in inductor compile time for hf_Bert_large. The time spent in `LoopBody.__init__` improves from 15% to 8% of `fx_codegen_and_compile`. Before ![image](https://github.com/user-attachments/assets/7de0f28e-35bd-472f-b4be-b52733d2a85c) After ![image](https://github.com/user-attachments/assets/5f0cf11a-43c5-43ae-b13c-f32383a75a7f) Overall ![image](https://github.com/user-attachments/assets/6a369d8c-fb5e-4ad2-9504-0fc745ad6568) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135235 Approved by: https://github.com/oulgen ghstack dependencies: #135070, #135076, #135082, #135084, #135079	2024-09-06 06:11:55 +00:00
Jason Ansel	66da3b3b2a	[fx] Bypass custom __setattr__ in Node.__init__ (#135079 ) Before: ![image](https://github.com/user-attachments/assets/5f0a6ae6-6049-44d0-b5f2-a549a23ad97f) After: ![image](https://github.com/user-attachments/assets/51c9f91b-f8a0-4043-8362-65813feec823) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135079 Approved by: https://github.com/oulgen ghstack dependencies: #135070, #135076, #135082, #135084	2024-09-06 06:11:46 +00:00
Laith Sakka	41e653456e	[RDP] Fix "No module named 'libfb’" (#135244 ) Summary: D62215095 Introduced an import error to arvr pipelines as the is_fbcode() function does not work as intended. This changes is_fbcode() to be a much stricter check. Test Plan: ``` buck2 run arvr/mode/platform010/opt-stripped //arvr/libraries/depthlink/clients/mr_replay:pipeline_runner -c bolt.use_eva3_sim=True -- --config_file arvr/libraries/depthlink/clients/mr_replay/configs/runner_config.yaml --features DEPTH ``` Differential Revision: D62237502 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135244 Approved by: https://github.com/aorenste	2024-09-06 04:52:31 +00:00
chilli	e40a0a9359	Add randomness checking for sdpa vmap (#135176 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135176 Approved by: https://github.com/zou3519	2024-09-06 04:50:49 +00:00
Xuan Zhang	c05a7adb36	[inductor][debug] fix draw_buffers (#135266 ) Before: ![image](https://github.com/user-attachments/assets/aac756f3-1349-4647-9da3-87cf105cf647) After: <img width="791" alt="image" src="https://github.com/user-attachments/assets/d72c663c-e598-42fa-ac40-9e58956f1ec1"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/135266 Approved by: https://github.com/yf225	2024-09-06 04:12:41 +00:00
hippocookie	5f57be7571	[Distributed] Change function call in test to non-deprecated to eliminate warning (#134938 ) Migrate function call in test to eliminate warning message in below and reduce the chance of test fail when methods removed - from deprecated `save_state_dict` change to `save` - from deprecated `load_state_dict` change to `load` Warning message: ```bash pytorch/test/distributed/checkpoint/test_fsdp_model_state.py:37: FutureWarning: `save_state_dict` is deprecated and will be removed in future versions.Please use `save` instead. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/134938 Approved by: https://github.com/wz337, https://github.com/fegin	2024-09-06 03:25:09 +00:00
Xu Han	29d72c1100	[inductor] check intel compiler minimal version (#135209 ) On Windows: early version icx has `-print-file-name` issue, and can't preload correctly for inductor. Add minimal version check for Intel compiler. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135209 Approved by: https://github.com/ezyang	2024-09-06 03:21:07 +00:00
leslie-fang-intel	3b1a334c0f	[Inductor][CPP] Avoid mistake wgt tensor delete (#135100 ) Summary Fix issue: https://github.com/pytorch/pytorch/issues/134998: Previously, we only checked if the `get_attr` FX node for the weight had a single user node. However, two `get_attr` nodes may share the same tensor and should not be deleted in such cases. In this PR, we add the count of users for tensor along with the num of users for nodes to decide whether this tensor can be deleted or not. TestPlan ``` python test/inductor/test_cpu_select_algorithm.py -k test_linear_wgt_multi_users ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/135100 Approved by: https://github.com/jgong5	2024-09-06 03:13:36 +00:00
leslie-fang-intel	07689a38bf	[Inductor] Fix AOT weight alignment issue on CPU (#135205 ) Summary Fix issue: https://github.com/pytorch/pytorch/issues/135027. On CPU, the `consts_size` used to generate `_binary_constants_bin_start` is not padded to `ALIGN_BYTES`, while `serialized_weights` is, causing a failure in the 16K alignment check. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135205 Approved by: https://github.com/jgong5, https://github.com/desertfire	2024-09-06 03:06:51 +00:00
Edward Z. Yang	06a7dc21c1	Remove dead expect_rational (#135105 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/135105 Approved by: https://github.com/malfet	2024-09-06 02:57:27 +00:00
Edward Z. Yang	d9a18173fa	Report qualname of exception type rather than <class 'RuntimeError'> (#135146 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/135146 Approved by: https://github.com/Skylion007, https://github.com/albanD, https://github.com/yanboliang ghstack dependencies: #135148, #135145	2024-09-06 02:56:50 +00:00
Edward Z. Yang	d8543e3162	Include exception type qualname when rewrapping InternalTorchDynamoError (#135145 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/135145 Approved by: https://github.com/drisspg, https://github.com/anijain2305 ghstack dependencies: #135148	2024-09-06 02:56:50 +00:00
Edward Z. Yang	ad01fc194d	Consolidate raise and rewrap raise error branches (#135148 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/135148 Approved by: https://github.com/anijain2305, https://github.com/albanD, https://github.com/yanboliang, https://github.com/malfet	2024-09-06 02:56:46 +00:00
Haibo Chen	e162414963	add instrumentation of CCA stats for reserved and allocated memory size (#135231 ) As titled Pull Request resolved: https://github.com/pytorch/pytorch/pull/135231 Approved by: https://github.com/c-p-i-o	2024-09-06 02:48:56 +00:00
Edward Z. Yang	9e5a797771	Improve test_public_bindings import module error reporting (#135258 ) Error was hard to understand without message. Render it now. See https://github.com/pytorch/pytorch/pull/135259 for it in action. Example failure: ``` 2024-09-05T20:04:45.3022000Z FAILED [5.9524s] test_public_bindings.py::TestPublicBindings::test_modules_can_be_imported - AssertionError: String comparison failed: '' != "torch._logging.scribe failed to import w[112 chars].py)" 2024-09-05T20:04:45.3025413Z + torch._logging.scribe failed to import with error ImportError: cannot import name 'TypeAlias' from 'typing' (/opt/conda/envs/py_3.9/lib/python3.9/typing.py) 2024-09-05T20:04:45.3026990Z ``` Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/135258 Approved by: https://github.com/albanD	2024-09-06 02:40:03 +00:00
atalman	b46a1b9e2d	Use Python 3.9 on all libtorch jobs (#135245 ) Part of the migration py3.8->3.9 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135245 Approved by: https://github.com/izaitsevfb	2024-09-06 02:27:22 +00:00
Sunita Nadampalli	9688014820	aarch64: extend matmul heuristic checks to all neoverse platforms (#134548 ) for aarch64 neoverse platforms there are two gemm backends available for matmul operator on PyTorch: (1) Arm Compute Library and (2) OpenBLAS. While Arm Compute Library provides better performance over OpenBLAS, it has overhead for the kernel launch time, and hence we use OpenBLAS for smaller tensor compute. The heuristic was originally implemented for neoverse_v1. This commit extends the heuristic to other neoverse platforms Pull Request resolved: https://github.com/pytorch/pytorch/pull/134548 Approved by: https://github.com/malfet	2024-09-06 01:40:50 +00:00
titaiwangms	8f6e73f068	[ONNX] Enable experimental exporter logic to dynamo_export and support refine dynamic_shapes (#134976 ) (1) Enable experimental exporter logic to dynamo_export (2) Refine dynamic shapes and retry export in export strategies (3) Delete `torch_export_graph_extractor` and use the new export logic (4) Disable ExportedProgram test in `test_fx_onnx_with_onnxruntime.py`, as ONNXProgram is different now. Fixes https://github.com/pytorch/pytorch/issues/126479 Fixes #135183 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134976 Approved by: https://github.com/justinchuby	2024-09-06 01:29:56 +00:00
Bin Bao	1e57ef08fa	[AOTI] Support MKLDNN qconv ops in cpp wrapper (#134795 ) Summary: Similar to https://github.com/pytorch/pytorch/pull/134475, support qconv in the ABI-compatible mode for cpp-wrapper Inductor. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134795 Approved by: https://github.com/leslie-fang-intel, https://github.com/chunyuan-w, https://github.com/angelayi ghstack dependencies: #134475, #134783	2024-09-06 01:01:53 +00:00
Bin Bao	614b86d602	[AOTI] Support MKLDNN qlinear ops in cpp wrapper (#134783 ) Summary: Similar to https://github.com/pytorch/pytorch/pull/134475, support qlinear in the ABI-compatible mode for cpp-wrapper Inductor. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134783 Approved by: https://github.com/leslie-fang-intel, https://github.com/chunyuan-w, https://github.com/angelayi ghstack dependencies: #134475	2024-09-06 01:01:53 +00:00
Bin Bao	0b96dfb736	[AOTI] Support MKLDNN conv ops in cpp wrapper (#134475 ) Summary: Partially fix https://github.com/pytorch/pytorch/issues/123040. In the ABI-compatible mode, MKLDNN fallback ops do not have C shim implementations and thus need to go through the custom ops launch path. Other MLKDNN ops will be fixed in following PRs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134475 Approved by: https://github.com/leslie-fang-intel, https://github.com/chunyuan-w, https://github.com/angelayi	2024-09-06 01:01:53 +00:00
Shivam Raikundalia	62b221d5cc	Add Percentages to Function Events (#135155 ) Summary: Users have recently asked that the profiler contains self/total CPU and device percentages to FunctionEvents so that teams can process the data procedurely. Some of it could be done mathematically via subroutines but since we already have the information in the _build_table, lets build it there. Test Plan: Check that we have the same table as before but also check that the parameters we check also have the expected values Differential Revision: D62210351 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135155 Approved by: https://github.com/shanw-meta, https://github.com/kit1980	2024-09-06 00:39:11 +00:00
Laith Sakka	66dd4577b1	Track base of FunctionalTensor in inference mode. (#135141 ) The idea behind the tracking is the following, whenever we see a tensor if the tensors is a root tensors (does not have any view metas ) when we consider is as the base of the all the tensors that shares its storage. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135141 Approved by: https://github.com/zou3519	2024-09-06 00:10:25 +00:00
cyy	cc28634172	[Submodule] Bump pybind11 to v2.13.5 (#135202 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/135202 Approved by: https://github.com/Skylion007	2024-09-06 00:09:00 +00:00
wz337	c83cdf068b	[DTensor] Fix view op replicating on tensor dim when the size of the tensor dim = 1 (#135054 ) We found a corner case that when a tensor dimension is 1, calling `view(1)` would result in an unexpected replication (see case 1 below). When the tensor dimension to shard is not 1, no matter whether the tensor dimension is evenly-shardable across the mesh dimension, it won't cause an implicit replication behind the scenes if view doesn't change the size of the given tensor dimension (see case 2 and 3). When the tensor dimension to shard is of size 1, it is not being added to shardable_dims here: https://github.com/pytorch/pytorch/blob/main/torch/distributed/_tensor/ops/_view_ops.py#L518 ``` # uneven case where the size of the tensor dimension to shard is 1 p = torch.randn(1,2) mesh = init_device_mesh(“cuda”, (2,)) dtensor = distribute_tensor(p, mesh, [Shard(0)]) t = dtensor.view(1, 2) # this would result in replication, meaning t is now replicated across all ranks. # uneven case where the size of the tensor dimension to shard is not 1 p = torch.randn(3, 2) mesh = init_device_mesh(“cuda”, (2,)) dtensor = distribute_tensor(p, mesh, [Shard(0)]) t = dtensor.view(3, 2) # this would not result in replication. # this would not result in replication, meaning t stays as sharded. # even case p = torch.randn(2,2) dtensor = distribute_tensor(p, mesh, [Shard(0)]) t = dtensor.view(2, 2) # this would not result in replication, meaning t stays as sharded. ``` Differential Revision: [D62155606](https://our.internmc.facebook.com/intern/diff/D62155606) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135054 Approved by: https://github.com/tianyu-l, https://github.com/wanchaol	2024-09-06 00:03:54 +00:00
titaiwangms	28ccfba248	[ONNX] Delete ONNXProgramSerializer (#135261 ) Fixes #135182 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135261 Approved by: https://github.com/justinchuby	2024-09-05 23:52:51 +00:00
Jason Ansel	b2386bdca1	[debug] Add helper to run cProfile on a function (#135084 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135084 Approved by: https://github.com/oulgen ghstack dependencies: #135070, #135076, #135082	2024-09-05 23:41:30 +00:00
Jason Ansel	bdfc8d9f96	[fx] Don't use generators in map_aggregate (#135082 ) While the generators avoid a copy, they are slow. Before: ![image](https://github.com/user-attachments/assets/70a55a9a-0595-4105-b0ab-22cf77c7409c) After: ![image](https://github.com/user-attachments/assets/cecb9c59-ae36-47de-8b08-cab2c7cb3d57) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135082 Approved by: https://github.com/oulgen ghstack dependencies: #135070, #135076	2024-09-05 23:41:30 +00:00
Jason Ansel	70779dded8	[fx] Compile time optimization in Node.__update_args_kwargs (#135076 ) Before this we took two passes over all of the args. Before: ![image](https://github.com/user-attachments/assets/24ce5628-03f4-4983-9f2d-5ddf0ca5816e) After: ![image](https://github.com/user-attachments/assets/c9681aa2-32f0-4f6b-a598-fc6f90ffafb5) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135076 Approved by: https://github.com/Chillee ghstack dependencies: #135070	2024-09-05 23:41:30 +00:00
Jason Ansel	ea231300d1	[inductor] Improve compile time regression from MemoryDep.normalize (#135070 ) Possible fix for #135056 Before ![image](https://github.com/user-attachments/assets/3962cb85-e808-4fd4-991f-471ff5ef7eae) After ![image](https://github.com/user-attachments/assets/2322d48d-6518-4518-baca-336027b5cda8) Measured based on: ``` python benchmarks/dynamo/torchbench.py --ci --accuracy --timing --explain --inductor --device cuda --training --only hf_Bert_large --stats -n1 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/135070 Approved by: https://github.com/Chillee	2024-09-05 23:41:30 +00:00
PyTorch MergeBot	8f66995459	Revert "Support rolling over a percentage of workflows (#134816 )" This reverts commit fc890b55b51098437b6149abf1026a8b2aaee389. Reverted https://github.com/pytorch/pytorch/pull/134816 on behalf of https://github.com/malfet due to Causes lint to intermittently fail ([comment](https://github.com/pytorch/pytorch/pull/134816#issuecomment-2332902609))	2024-09-05 23:39:41 +00:00
Kulin Seth	144fde4fd2	[MPS] Add support for autocast in MPS (#99272 ) Fixes https://github.com/pytorch/pytorch/issues/88415 Need to run inductor/test_cpu_select_algorithm Pull Request resolved: https://github.com/pytorch/pytorch/pull/99272 Approved by: https://github.com/malfet Co-authored-by: Siddharth Kotapati <skotapati@apple.com> Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com> Co-authored-by: Roy Hvaara <roy@lightyear.no>	2024-09-05 23:23:17 +00:00
Avik Chaudhuri	43f4947d44	fix fake tensor tolist implementation (#135131 ) Summary: When exporting for training with `tolist`, we do not hit `FunctionalTensor.tolist` since we do not functionalize. Unfortunately, this means we hit `FakeTensor.tolist`, which creates unbacked symints that are not backed by proxies. Rather than trying to patch up this low-level implementation, we replace it with essentially what `FunctionalTensor.tolist` does, which is higher-level: we essentially desugar to `item()` calls and let it take care of unbacked symints. Test Plan: Some expected failures are gone now. Also found a test for `tolist` that was written when `FunctionalTensor.tolist` was implemented but not really doing much; repurposed it now to exercise more modes. Differential Revision: D62197742 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135131 Approved by: https://github.com/ezyang	2024-09-05 23:20:31 +00:00
Chirag Pandya	65e1c34061	[rfc] scuba for flight recorder (#134794 ) Summary: Record flight recorder status in a scuba table. Test Plan: Testing with timing out a job. Will post results soon. Differential Revision: D61729221 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134794 Approved by: https://github.com/fduwjj	2024-09-05 23:18:10 +00:00
Stonepia	830247c355	[Intel Triton] Update Intel Triton to release/2.5.0 (#134074 ) This PR relands https://github.com/pytorch/pytorch/pull/134053 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134074 Approved by: https://github.com/EikanWang	2024-09-05 22:46:31 +00:00
Yidi Wu	4262755b5a	[cond] fix typo in cond codegen (#134708 ) As titled. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134708 Approved by: https://github.com/jansel	2024-09-05 22:38:24 +00:00
Edward Z. Yang	3825607144	Add torch._logging.scribe (#135224 ) See https://github.com/pytorch/pytorch/pull/135138 for a usage example. Meta only, see https://docs.google.com/document/d/1JpbAQvRhTmuxjnKKjT7qq57dsnV84nxSLpWJo1abJuE/edit#heading=h.9wi46k7np6xw for context fbscribelogger is a library that allows us to write to scribe, which is Meta's logging infrastructure, when you have appropriate access token (this token is available for jobs running on main, as well as authorized jobs with the ci-scribe label). The resulting data is accessible via Scuba (a real time in-memory database) and Hive (a more traditional SQL persisted database). Here's the motivating use case. Suppose there is somewhere in PyTorch's codebase where you'd like to log an event, and then you'd like to find all the situations where this log is called. If PyTorch is rolled out to our internal users, we have some FB-oriented APIs (like torch._utils_internal.signpost_event) with which you can do this. But you have to actually land your PR to main, wait for it to be ingested to fbcode, and then wait for us to actually roll out this version, before you get any data. But what if you want the results within the next few hours? Instead, you can use torch._logging.scribe to directly write to our logging infrastructure from inside CI jobs. The most convenient approach is to log unstructured JSON blobs to `open_source_signpost` (added in this PR; you can also add your own dedicated table as described in the GDoc above). After adding logging code to your code, you can push your PR to CI, add 'ci-scribe' label, and in a few hours view the results in Scuba, e.g., (Meta-only) https://fburl.com/scuba/torch_open_source_signpost/z2mq8o4l If you want continuous logging on all commits on master, you can land your PR and it will be continuously get logging for all CI runs that happen on main. Eventually, if your dataset is important enough, you can consider collaborating with PyTorch Dev Infra to get the data collected in our public AWS cloud so that OSS users can view it without access to Meta's internal users. But this facility is really good for prototyping / one-off experiments. It's entirely self serve: just add your logging, run your PR CI with ci-scribe, get results, do analysis in Scuba. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/135224 Approved by: https://github.com/Skylion007	2024-09-05 22:37:13 +00:00
eqy	3c8f71ff93	[cuDNN][64-bit indexing] cuDNN v9.3+ supports non-batch-splittable convolutions with > 2**31 elements (#134890 ) For longstanding issues such as #95024 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134890 Approved by: https://github.com/Skylion007	2024-09-05 22:22:45 +00:00
Zain Rizvi	fc890b55b5	Support rolling over a percentage of workflows (#134816 ) In order to support adding a rollover percentage, this ended up being a complete rewrite of runner_determinator.py. Details of the new format are in the comments up top. On the plus side, this now includes some unit tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134816 Approved by: https://github.com/PaliC, https://github.com/zxiiro	2024-09-05 22:21:45 +00:00
Animesh Jain	058a69d91a	[fbcode][dynamo] Turn on guard_nn_modules using justknobs_check (#134928 ) As Title Pull Request resolved: https://github.com/pytorch/pytorch/pull/134928 Approved by: https://github.com/ezyang	2024-09-05 22:05:54 +00:00
sanchitintel	6c5920d515	Tune int8 AMX WoQ micro-kernel for CPU (#134832 ) This patch prevents performance regression against the default ATen implementation for LLaMA 3.1 int8 GPTQ WoQ workload. Uses AMX micro-kernel only if `M` >= `block_m` Pull Request resolved: https://github.com/pytorch/pytorch/pull/134832 Approved by: https://github.com/jgong5	2024-09-05 22:01:14 +00:00
Zhengxu Chen	116fd474da	[export] Expand coverage to more copied sym ops for unflattener. (#135119 ) Test Plan: buck2 test 'fbcode//mode/opt' fbcode//torchrec/ir/tests:test_serializer -- --run-disabled ``` File changed: fbcode//caffe2/torch/export/unflatten.py Buck UI: https://www.internalfb.com/buck2/2e0377e7-e2b6-4bd0-8133-a787245165a0 Test UI: https://www.internalfb.com/intern/testinfra/testrun/5066549824883887 Network: Up: 0B Down: 0B Jobs completed: 16. Time elapsed: 10.2s. Tests finished: Pass 6. Fail 0. Fatal 0. Skip 0. Build failure 0 ``` Differential Revision: D62190172 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135119 Approved by: https://github.com/yushangdi	2024-09-05 21:58:20 +00:00
Scott Wolchok	a5d70cf545	[PyTorch] Add isfinite to BFloat16-math.h (#135052 ) Missing function from <cmath>. Differential Revision: [D62148884](https://our.internmc.facebook.com/intern/diff/D62148884/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135052 Approved by: https://github.com/PaliC, https://github.com/albanD ghstack dependencies: #135031	2024-09-05 21:50:36 +00:00
Scott Wolchok	7fe819d917	[PyTorch] Fix -Wshadow -Werror build in BFloat16-inl.h (#135031 ) `float_t` is required to exists in C99 math.h, which causes -Wshadow to fire. We don't need the alias, fortunately. Differential Revision: [D62135908](https://our.internmc.facebook.com/intern/diff/D62135908/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135031 Approved by: https://github.com/albanD	2024-09-05 21:48:21 +00:00
PyTorch MergeBot	f63571060c	Revert "Use actions/upload-artifact@v4.4.0 for rest of workflows (#135264 )" This reverts commit 9c0b03020b7204ca5d5dbe18174bab005f79c47b. Reverted https://github.com/pytorch/pytorch/pull/135264 on behalf of https://github.com/atalman due to broke CI ([comment](https://github.com/pytorch/pytorch/pull/135264#issuecomment-2332674607))	2024-09-05 21:43:05 +00:00
Yidi Wu	38fead8f7c	[hop] preserve metadata in re-tracing hop subgraph by running with interpreter (#135159 ) In this way, the interpreter.run can preserve the current metadata of subgraphs correctly when tracing the subgraphs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135159 Approved by: https://github.com/tugsbayasgalan	2024-09-05 21:36:56 +00:00
Huy Do	24a223c49d	Run inductor micro benchmark on x86 metal runner (#135042 ) This enables inductor micro benchmark on CPU (x86): * Running on AWS metal runner for more accurate benchmark * I add a new `arch` column, which will be either x86_64 or arm64 for CPU or GPU name for GPU. We can use this later to differentiate between different setup, i.e. cuda (a100) vs cuda (a10g) or cpu (x86_64) vs cpu (arm64) The next step would be to run this one cpu arm64, and cuda (a10g). ### Testing Here is the CSV results from my test run https://github.com/pytorch/pytorch/actions/runs/10709344180 ``` name,metric,target,actual,dtype,device,arch,is_model mlp_layer_norm_gelu,flops_utilization,0.8,17.36,bfloat16,cpu,x86_64,False gather_gemv,memory_bandwidth(GB/s),990,170.80,int8,cpu,x86_64,False gather_gemv,memory_bandwidth(GB/s),1060,204.78,bfloat16,cpu,x86_64,False Mixtral-8x7B-v0.1,token_per_sec,175,26.68,int8,cpu,x86_64,True Mixtral-8x7B-v0.1,memory_bandwidth(GB/s),1130,171.91,int8,cpu,x86_64,True Mixtral-8x7B-v0.1,compilation_time(s),162,47.36,int8,cpu,x86_64,True gemv,memory_bandwidth(GB/s),870,236.36,int8,cpu,x86_64,False gemv,memory_bandwidth(GB/s),990,305.71,bfloat16,cpu,x86_64,False Llama-2-7b-chat-hf,token_per_sec,94,14.01,bfloat16,cpu,x86_64,True Llama-2-7b-chat-hf,memory_bandwidth(GB/s),1253,185.18,bfloat16,cpu,x86_64,True Llama-2-7b-chat-hf,compilation_time(s),162,74.99,bfloat16,cpu,x86_64,True Llama-2-7b-chat-hf,token_per_sec,144,25.09,int8,cpu,x86_64,True Llama-2-7b-chat-hf,memory_bandwidth(GB/s),957,165.83,int8,cpu,x86_64,True Llama-2-7b-chat-hf,compilation_time(s),172,70.69,int8,cpu,x86_64,True layer_norm,memory_bandwidth(GB/s),950,172.03,bfloat16,cpu,x86_64,False ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/135042 Approved by: https://github.com/yanboliang	2024-09-05 21:31:36 +00:00
Will Feng	e4920a1364	[Traceable FSDP2][Dynamo] allow tracing through auto_functionalized HOP (#135169 ) If an `auto_functionalized` HOP is included in backward graph due to activation checkpointing, we will run into a scenario where Compiled Autograd Dynamo tracing will need to trace through the `auto_functionalized` HOP. This PR adds support for it. Test commands: - `pytest -rA test/inductor/test_compiled_autograd.py::TestCompiledAutograd::test_trace_auto_functionalized` Pull Request resolved: https://github.com/pytorch/pytorch/pull/135169 Approved by: https://github.com/zou3519	2024-09-05 21:22:45 +00:00
Shangdi Yu	bc5ecf83d7	[training ir migration] Fix quantization tests (#135184 ) Summary: Fixed some quantization tests for new training ir: Fix batch norm node pattern matcher. In training ir, we have `aten.batch_norm` node instead of `aten._native_batch_norm_legit` and `aten._native_batch_norm_legit_no_training`. Test Plan: ``` buck run fbcode//mode/dev-nosan fbcode//caffe2/test:quantization_pt2e ``` Reviewed By: tugsbayasgalan Differential Revision: D62209819 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135184 Approved by: https://github.com/tugsbayasgalan	2024-09-05 21:19:28 +00:00
PyTorch MergeBot	e55c0f59e5	Revert "[Reland] Refactor caching device allocator utils (#130923 )" This reverts commit 9809080b9ed657a8c0ea0383be7cbdce3a26e05e. Reverted https://github.com/pytorch/pytorch/pull/130923 on behalf of https://github.com/kit1980 due to breaking internal builds - Error: Relocation overflow has occured ([comment](https://github.com/pytorch/pytorch/pull/130923#issuecomment-2332640961))	2024-09-05 21:16:14 +00:00
PyTorch MergeBot	a4cf9653ee	Revert "Remove Caffe2 code from tool scripts (#134941 )" This reverts commit c818ecd1698a28d9fadf4a81453a89914b18374a. Reverted https://github.com/pytorch/pytorch/pull/134941 on behalf of https://github.com/kit1980 due to breaking internal builds - The path `caffe2/operators/hip/gather_op.cuh` does not exist ([comment](https://github.com/pytorch/pytorch/pull/134941#issuecomment-2332636624))	2024-09-05 21:12:54 +00:00
atalman	9c0b03020b	Use actions/upload-artifact@v4.4.0 for rest of workflows (#135264 ) To be consistent with https://github.com/pytorch/pytorch/pull/135263 and rest of workflows. Use v4.4.0. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135264 Approved by: https://github.com/kit1980, https://github.com/malfet	2024-09-05 21:05:06 +00:00
Jack Taylor	034717a029	[ROCm] remove triton-rocm commit pin and merge pins with triton.txt (#133438 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133438 Approved by: https://github.com/jithunnair-amd, https://github.com/malfet Co-authored-by: Jithun Nair <37884920+jithunnair-amd@users.noreply.github.com>	2024-09-05 20:36:45 +00:00
Angela Yi	9c38b00999	[export] Add ability to run eagerly on UnflattenedModule (#133996 ) Summary: Added the contextmanager, `_disable_interpreter`, which is meant to put around a call to `unflatten`. This will generate an UnflattendModule and sub-InterpreterModules which will not use torch.fx.Interpreter to run eagerly. We want to have this as a state of the module instead of a contextmanager around running the module because it's not clear where we are calling the unflattened module. This seems to improve the performance: https://fb.workplace.com/groups/1075192433118967/posts/1473590629945810/?comment_id=1473621763276030 Test Plan: CI Differential Revision: D60939034 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133996 Approved by: https://github.com/pianpwk	2024-09-05 20:28:42 +00:00
atalman	8efe547046	Use actions/upload-artifact@v4.4.0 for triton builds (#135263 ) Same as: https://github.com/pytorch/pytorch/pull/135139 Fixes upload failure: https://github.com/pytorch/pytorch/actions/runs/10722567217/job/29748125015 fix regression introduced by https://github.com/pytorch/pytorch/pull/135068 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135263 Approved by: https://github.com/kit1980, https://github.com/huydhn	2024-09-05 20:03:39 +00:00
rzou	82d00acfee	Allow cross-device copies for cpu scalars in refs (#135140 ) This copies our eager-mode behavior where someone can do torch.add(a, b, out=c) where a and b are CPU scalar tensors and c is a CUDA tensor. Fixes https://github.com/pytorch/pytorch/issues/121619 by side effect (we get into a situation where we're writing a CPU scalar into a FakeTensor that is actually a meta tensor) Test Plan: - new test Pull Request resolved: https://github.com/pytorch/pytorch/pull/135140 Approved by: https://github.com/williamwen42, https://github.com/yanboliang	2024-09-05 19:08:48 +00:00
Zhonglin Han	098431a29d	Update Resize.cpp with new device type (#135117 ) Update Resize.cpp with new device type Pull Request resolved: https://github.com/pytorch/pytorch/pull/135117 Approved by: https://github.com/egienvalue	2024-09-05 18:53:13 +00:00
Xintong Hu	be660ea2d3	[PT2] Directly set meta.val in group_batch_fusion_aten (#135078 ) Summary: instead of using FakeTensorProp after the pass Differential Revision: D62162640 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135078 Approved by: https://github.com/frank-wei	2024-09-05 18:17:06 +00:00
CaoE	52c7c89ea4	[Inductor][CPP] Leverage full bits for BF16/FP16 vectorization (#126502 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/126502 Approved by: https://github.com/jgong5, https://github.com/jansel	2024-09-05 17:17:46 +00:00
IvanKobzarev	1efd341d15	[fake_tensor] Move unrecognized_type NotImplemented before ConstProp (#135033 ) We should not try to do ConstProp on the unrecognized types (e.g. Subclasses). In case of those types throwing NotImplemented will jump to the next torch_dispatch. Test: ``` python test/functorch/test_aotdispatch.py -k test_aot_test_subclasses_with_tensor_factories ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/135033 Approved by: https://github.com/zou3519, https://github.com/bdhirsh	2024-09-05 17:09:41 +00:00
Mikayla Gawarecki	a096f2899d	Add torch.serialization.skip_data context manager (#134504 ) ## Semantic The semantic is (1) By default `torch.serialization.skip_data(materialize_fake_tensors=False)` will make `torch.save` skip writing storages (but reserve space for them in the checkpoint). ```python import torch import torch.nn as nn sd = nn.Linear(3, 5).state_dict() with torch.serialization.skip_data(): torch.save(sd, 'foo.pt') print(torch.load('foo.pt', weights_only=True)) ``` (2) With `torch.serialization.skip_data(materialize_fake_tensors=True)`If FakeTensor is passed to `torch.save` the pickler will treat these FakeTensors as being "materialized" space will be reserved in the checkpoint for the associated storage bytes, and when loading the type will be Tensor instead of FakeTensor) ```python import torch import torch.nn as nn from torch._subclasses.fake_tensor import FakeTensorMode with FakeTensorMode(): m = nn.Linear(3, 5, dtype=torch.float16, device='cuda') sd = m.state_dict() with torch.serialization.skip_data(materialize_fake_tensors=True): torch.save(sd, 'bla.pt') print(torch.load('bla.pt', weights_only=True)) # OrderedDict([('weight', tensor([[0., 0., 0.], # [0., 0., 0.], # [0., 0., 0.], # [0., 0., 0.], # [0., 0., 0.]], device='cuda:0', dtype=torch.float16)), ('bias', tensor([0., 0., 0., 0., 0.], device='cuda:0', dtype=torch.float16))]) ``` ## Follow Ups - [ ] `torch.load` semantic for skip_data context manager - [ ] Mechanism for getting offsets of storages saved via this method (for writing in a separate pass) Differential Revision: [D62238610](https://our.internmc.facebook.com/intern/diff/D62238610) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134504 Approved by: https://github.com/albanD	2024-09-05 16:53:39 +00:00
Edward Z. Yang	dbeb8a1691	Render log filepaths that are not anchored in torch's directory in a reasonable way (#135165 ) For example, if I do TORCH_LOGS=fbscribelogger I'll get: ``` I0904 17:59:07.567000 3672513 fbscribelogger/__init__.py:161] stop ``` instead of ``` I0904 12:46:15.332000 2930287 ../../../../../home/ezyang/local/a/pytorch-env/lib/python3.10/site-packages/fbscribelogger/__init__.py:161] stop ``` Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/135165 Approved by: https://github.com/Skylion007	2024-09-05 16:48:09 +00:00
mori360	b1f72e2984	Gradient scaler for DTensor (#132816 ) Solve the request [here](https://github.com/pytorch/pytorch/issues/120003#issuecomment-2248805798). Enable DTensor input in gradient scaler's APIs, especially on `.unscale_()` Related dispatch strategy is added to accept DTensor input. To enable found_inf to conduct reduce action across devices, we add allreduce at dispatch with args after dispatch strategy and kernel. Since `aten._amp_foreach_non_finite_check_and_unscale_.default` is an inplace_op, grad_scale as the arg[0] with be inplaced, so that redesign a strategy or refactoring the kernel would not help Test files are testing 2 parts under 1-d(dp) and 2-d(dp,tp) cases: 1. whether the non-inf values unscaled 2. whether all DTensors at each device could found inf even not at their device. 3. If inf not found, will new parameters generates 4. if inf found, will scale be updated Pull Request resolved: https://github.com/pytorch/pytorch/pull/132816 Approved by: https://github.com/XilunWu, https://github.com/weifengpy, https://github.com/wanchaol	2024-09-05 16:44:32 +00:00
Henry Tsang	bb3c2408f4	[inductor][test] in test_unbacked_symints, replace inductor's skipCUDAIf with common device type's skipcudaif (#133936 ) Differential Revision: D61506212 Use `skipCUDAIf` from `torch.testing._internal.common_device_type` if we create the test class with `instantiate_device_type_tests`. `instantiate_device_type_tests` would make sure the class has attr device_type, which works with`skipCUDAIf` from `torch.testing._internal.common_device_type`. Also skipping test_vertical_pointwise_reduction_fusion for cpu test class, since the test expects cuda. FAILED [0.0026s] test/inductor/test_unbacked_symints.py::TestUnbackedSymintsCPU::test_vertical_pointwise_reduction_fusion_cpu - AttributeError: 'TestUnbackedSymintsCPU' object has no attribute 'device' repro: ``` CUDA_VISIBLE_DEVICES="" pytest test/inductor/test_unbacked_symints.py -k cpu -v ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/133936 Approved by: https://github.com/ColinPeppler, https://github.com/desertfire	2024-09-05 16:40:14 +00:00
Tom Ritchford	2c99f17a32	Implement VariableTracker.python_type() (#134215 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134215 Approved by: https://github.com/amjames, https://github.com/jansel	2024-09-05 16:35:47 +00:00
Tarun Karuturi	0043dcd79e	Switch torch pt2e xnnpack tests to use export_for_training (#134788 ) Migrate all the callsites inside the pt2e XNNPACK tests to use export_for_training. Differential Revision: D61994553 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134788 Approved by: https://github.com/mergennachin	2024-09-05 16:11:18 +00:00
Edward Z. Yang	2e2fb668fa	Upgrade expecttest to 0.2.1 (#135136 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/135136 Approved by: https://github.com/albanD, https://github.com/atalman, https://github.com/Skylion007	2024-09-05 16:05:35 +00:00
Stonepia	9d24f945ba	[CI] Use larger instance for building triton whl (#135201 ) When running CI jobs of "Build Triton Wheels", it failed due to the lack of resources. This PR uses a larger runner to avoid these issues. The failure message is like: ``` Process completed with exit code 137. ``` Related running actions: Failed actions: https://github.com/pytorch/pytorch/actions/runs/10714445036 Success actions: https://github.com/pytorch/pytorch/actions/runs/10716710830 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135201 Approved by: https://github.com/chuanqi129, https://github.com/atalman	2024-09-05 14:36:23 +00:00
min-jean-cho	ecbd715363	[Intel GPU][Windows] Fix overriding default CMAKE_CXX_FLAGS (#135093 ) The root cause is that `/EHsc` is part of the default `CMAKE_CXX_FLAGS` in CMake. Fix to not override the default `CMAKE_CXX_FLAGS`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135093 Approved by: https://github.com/EikanWang, https://github.com/atalman	2024-09-05 12:52:43 +00:00
Xinyu	58f2477a26	[Dynamo] Support builtin function frozenset (#134563 ) Support builtin function frozenset in dynamo Pull Request resolved: https://github.com/pytorch/pytorch/pull/134563 Approved by: https://github.com/anijain2305, https://github.com/EikanWang, https://github.com/jansel	2024-09-05 12:15:10 +00:00
sanchitintel	43dcb4bb61	Revise CPU vectorization ISA support API (#135075 ) Revising (mostly renaming) CPU vectorization ISA support API (non-frontend-user-facing). Also added AVX512_BF16 ISA detection API. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135075 Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5, https://github.com/ezyang	2024-09-05 12:14:56 +00:00
Bin Bao	50d1e37079	[AOTI] Fix a unbacked symint retrieve bug (#134670 ) Summary: Fix https://github.com/pytorch/pytorch/issues/134081. When a unbacked symint is computed as the shape of a tensor from a tuple, generated C++ code needs to use std::get<> to extract the tensor. Differential Revision: [D62142113](https://our.internmc.facebook.com/intern/diff/D62142113) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134670 Approved by: https://github.com/angelayi, https://github.com/22quinn, https://github.com/chenyang78	2024-09-05 11:34:14 +00:00
Feng Yuan	b99ef1a02e	Update torch-xpu-ops pin (ATen XPU implementation) (#135185 ) Release cycle for PyTorch 2.5 1. Update specific AOT targets for Windows. On Windows, AOT target list prefers Intel client GPUs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135185 Approved by: https://github.com/EikanWang	2024-09-05 10:05:23 +00:00
Jack Zhang	8a5c8e5db9	Update unbacked symints in masked_select more precisely (#134899 ) ## Summary At the moment, the fake impl for `masked_select` simply sets the upper range while updating its size-like SymInt to `sys.maxsize`(9223372036854775807, max value for an unsigned int64) if the there are any SymInts in the original input tensor shape. This PR constrains the range more intelligently by using the upper ranges of each SymInt in the input tensor shape. This solves an issue where an model being lowered to Executorch errors during memory planning because the memory allocated for `masked_select` ended up exceeded the 64-bit address space (`INT_MAX * size(dtype)`). ## Test plan - Passes existing unit tests (tests case where upper bound is inf) - Added unit test to verify upper bound reduction calculation - Tested end-to-end by exporting with TORCH_LOGS="export" and ensuring that the range for `masked_select`'s SymInt size has the correct upper bound Pull Request resolved: https://github.com/pytorch/pytorch/pull/134899 Approved by: https://github.com/ezyang	2024-09-05 09:01:06 +00:00
Yutao Xu	c7328dff7f	Enhance the stability of the complex divide code (#134647 ) In C++, when a floating-point literal (e.g., 3.14) is compared with a variable of type float, the literal is by default interpreted as a double. ```c++ float f = 3.14f; if (f == 3.14) { // Do something } ``` If a device does not support double, an error will occur. This PR addresses the issue of complex64 errors on machines that do not support double operations. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134647 Approved by: https://github.com/EikanWang, https://github.com/albanD	2024-09-05 08:36:37 +00:00

7213 changed files with 537690 additions and 146229 deletions

2

.bazelversion

View File

 @ -1 +1 @@
 .1.1
 .5.0

26

.buckconfig.oss

View File

 @ -1,26 +0,0 @@
 [pt]
   is_oss=1
 [buildfile]
   name = BUCK.oss
   includes = //tools/build_defs/select.bzl
 [repositories]
   bazel_skylib = third_party/bazel-skylib/
   ovr_config = .
 [download]
   in_build = true
 [cxx]
   cxxflags = -std=c++17
   ldflags = -Wl,--no-undefined
   should_remap_host_platform = true
   cpp = /usr/bin/clang
   cc = /usr/bin/clang
   cxx = /usr/bin/clang++
   cxxpp = /usr/bin/clang++
   ld = /usr/bin/clang++
 [project]
   default_flavors_mode=all

									
										19

.ci/aarch64_linux/README.md
									
										Normal file
									
												View File
												
				@ -0,0 +1,19 @@

				# Aarch64 (ARM/Graviton) Support Scripts

				Scripts for building aarch64 PyTorch PIP Wheels. These scripts build the following wheels:

				* torch

				* torchvision

				* torchaudio

				* torchtext

				* torchdata

				## Aarch64_ci_build.sh

				This script is design to support CD operations within PyPi manylinux aarch64 container, and be executed in the container. It prepares the container and then executes __aarch64_wheel_ci_build.py__ to build the wheels. The script "assumes" the PyTorch repo is located at: ```/pytorch``` and will put the wheels into ```/artifacts```.

				### Usage

				```DESIRED_PYTHON=<PythonVersion> aarch64_ci_build.sh```

				__NOTE:__ CI build is currently __EXPERMINTAL__

				## Build_aarch64_wheel.py

				This app allows a person to build using AWS EC3 resources and requires AWS-CLI and Boto3 with AWS credentials to support building EC2 instances for the wheel builds. Can be used in a codebuild CD or from a local system.

				### Usage

				```build_aarch64_wheel.py --key-name <YourPemKey> --use-docker --python 3.8 --branch <RCtag>```

									
										26

.ci/aarch64_linux/aarch64_ci_build.sh
									
										Normal file
									
												View File
												
				@ -0,0 +1,26 @@

				#!/bin/bash

				set -eux -o pipefail

				GPU_ARCH_VERSION=${GPU_ARCH_VERSION:-}

				SCRIPTPATH="$( cd -- "$(dirname "$0")" >/dev/null 2>&1 ; pwd -P )"

				source $SCRIPTPATH/aarch64_ci_setup.sh

				###############################################################################

				# Run aarch64 builder python

				###############################################################################

				cd /

				# adding safe directory for git as the permissions will be

				# on the mounted pytorch repo

				git config --global --add safe.directory /pytorch

				pip install -r /pytorch/requirements.txt

				pip install auditwheel

				if [ "$DESIRED_CUDA" = "cpu" ]; then

				    echo "BASE_CUDA_VERSION is not set. Building cpu wheel."

				    #USE_PRIORITIZED_TEXT_FOR_LD for enable linker script optimization https://github.com/pytorch/pytorch/pull/121975/files

				    USE_PRIORITIZED_TEXT_FOR_LD=1 python /pytorch/.ci/aarch64_linux/aarch64_wheel_ci_build.py --enable-mkldnn

				else

				    echo "BASE_CUDA_VERSION is set to: $DESIRED_CUDA"

				    #USE_PRIORITIZED_TEXT_FOR_LD for enable linker script optimization https://github.com/pytorch/pytorch/pull/121975/files

				    USE_PRIORITIZED_TEXT_FOR_LD=1 python /pytorch/.ci/aarch64_linux/aarch64_wheel_ci_build.py --enable-mkldnn --enable-cuda

				fi

									
										23

.ci/aarch64_linux/aarch64_ci_setup.sh
									
										Executable file
									
												View File
												
				@ -0,0 +1,23 @@

				#!/bin/bash

				set -eux -o pipefail

				# This script is used to prepare the Docker container for aarch64_ci_wheel_build.py python script

				# By creating symlinks from desired /opt/python to /usr/local/bin/

				NUMPY_VERSION=2.0.2

				PYGIT2_VERSION=1.15.1

				if [[ "$DESIRED_PYTHON"  == "3.13" ]]; then

				    NUMPY_VERSION=2.1.2

				    PYGIT2_VERSION=1.16.0

				fi

				SCRIPTPATH="$( cd "$(dirname "$0")" ; pwd -P )"

				source $SCRIPTPATH/../manywheel/set_desired_python.sh

				pip install -q numpy==${NUMPY_VERSION} pyyaml==6.0.2 scons==4.7.0 ninja==1.11.1 patchelf==0.17.2 pygit2==${PYGIT2_VERSION}

				for tool in python python3 pip pip3 ninja scons patchelf; do

				    ln -sf ${DESIRED_PYTHON_BIN_DIR}/${tool} /usr/local/bin;

				done

				python --version

									
										230

.ci/aarch64_linux/aarch64_wheel_ci_build.py
									
										Executable file
									
												View File
												
				@ -0,0 +1,230 @@

				#!/usr/bin/env python3

				# encoding: UTF-8

				import os

				import shutil

				from subprocess import check_call, check_output

				from typing import List

				from pygit2 import Repository

				def list_dir(path: str) -> List[str]:

				    """'

				    Helper for getting paths for Python

				    """

				    return check_output(["ls", "-1", path]).decode().split("\n")

				def build_ArmComputeLibrary() -> None:

				    """

				    Using ArmComputeLibrary for aarch64 PyTorch

				    """

				    print("Building Arm Compute Library")

				    acl_build_flags = [

				        "debug=0",

				        "neon=1",

				        "opencl=0",

				        "os=linux",

				        "openmp=1",

				        "cppthreads=0",

				        "arch=armv8a",

				        "multi_isa=1",

				        "fixed_format_kernels=1",

				        "build=native",

				    ]

				    acl_install_dir = "/acl"

				    acl_checkout_dir = "ComputeLibrary"

				    os.makedirs(acl_install_dir)

				    check_call(

				        [

				            "git",

				            "clone",

				            "https://github.com/ARM-software/ComputeLibrary.git",

				            "-b",

				            "v24.09",

				            "--depth",

				            "1",

				            "--shallow-submodules",

				        ]

				    )

				    check_call(

				        ["scons", "Werror=1", "-j8", f"build_dir=/{acl_install_dir}/build"]

				        + acl_build_flags,

				        cwd=acl_checkout_dir,

				    )

				    for d in ["arm_compute", "include", "utils", "support", "src"]:

				        shutil.copytree(f"{acl_checkout_dir}/{d}", f"{acl_install_dir}/{d}")

				def update_wheel(wheel_path) -> None:

				    """

				    Update the cuda wheel libraries

				    """

				    folder = os.path.dirname(wheel_path)

				    wheelname = os.path.basename(wheel_path)

				    os.mkdir(f"{folder}/tmp")

				    os.system(f"unzip {wheel_path} -d {folder}/tmp")

				    libs_to_copy = [

				        "/usr/local/cuda/extras/CUPTI/lib64/libcupti.so.12",

				        "/usr/local/cuda/lib64/libcudnn.so.9",

				        "/usr/local/cuda/lib64/libcublas.so.12",

				        "/usr/local/cuda/lib64/libcublasLt.so.12",

				        "/usr/local/cuda/lib64/libcudart.so.12",

				        "/usr/local/cuda/lib64/libcufft.so.11",

				        "/usr/local/cuda/lib64/libcusparse.so.12",

				        "/usr/local/cuda/lib64/libcusparseLt.so.0",

				        "/usr/local/cuda/lib64/libcusolver.so.11",

				        "/usr/local/cuda/lib64/libcurand.so.10",

				        "/usr/local/cuda/lib64/libnvToolsExt.so.1",

				        "/usr/local/cuda/lib64/libnvJitLink.so.12",

				        "/usr/local/cuda/lib64/libnvrtc.so.12",

				        "/usr/local/cuda/lib64/libnvrtc-builtins.so.12.6",

				        "/usr/local/cuda/lib64/libcudnn_adv.so.9",

				        "/usr/local/cuda/lib64/libcudnn_cnn.so.9",

				        "/usr/local/cuda/lib64/libcudnn_graph.so.9",

				        "/usr/local/cuda/lib64/libcudnn_ops.so.9",

				        "/usr/local/cuda/lib64/libcudnn_engines_runtime_compiled.so.9",

				        "/usr/local/cuda/lib64/libcudnn_engines_precompiled.so.9",

				        "/usr/local/cuda/lib64/libcudnn_heuristic.so.9",

				        "/lib64/libgomp.so.1",

				        "/usr/lib64/libgfortran.so.5",

				        "/acl/build/libarm_compute.so",

				        "/acl/build/libarm_compute_graph.so",

				    ]

				    if enable_cuda:

				        libs_to_copy += [

				            "/usr/local/lib/libnvpl_lapack_lp64_gomp.so.0",

				            "/usr/local/lib/libnvpl_blas_lp64_gomp.so.0",

				            "/usr/local/lib/libnvpl_lapack_core.so.0",

				            "/usr/local/lib/libnvpl_blas_core.so.0",

				        ]

				    else:

				        libs_to_copy += [

				            "/opt/OpenBLAS/lib/libopenblas.so.0",

				        ]

				    # Copy libraries to unzipped_folder/a/lib

				    for lib_path in libs_to_copy:

				        lib_name = os.path.basename(lib_path)

				        shutil.copy2(lib_path, f"{folder}/tmp/torch/lib/{lib_name}")

				        os.system(

				            f"cd {folder}/tmp/torch/lib/; "

				            f"patchelf --set-rpath '$ORIGIN' --force-rpath {folder}/tmp/torch/lib/{lib_name}"

				        )

				    os.mkdir(f"{folder}/cuda_wheel")

				    os.system(f"cd {folder}/tmp/; zip -r {folder}/cuda_wheel/{wheelname} *")

				    shutil.move(

				        f"{folder}/cuda_wheel/{wheelname}",

				        f"{folder}/{wheelname}",

				        copy_function=shutil.copy2,

				    )

				    os.system(f"rm -rf {folder}/tmp/ {folder}/cuda_wheel/")

				def complete_wheel(folder: str) -> str:

				    """

				    Complete wheel build and put in artifact location

				    """

				    wheel_name = list_dir(f"/{folder}/dist")[0]

				    if "pytorch" in folder and not enable_cuda:

				        print("Repairing Wheel with AuditWheel")

				        check_call(["auditwheel", "repair", f"dist/{wheel_name}"], cwd=folder)

				        repaired_wheel_name = list_dir(f"/{folder}/wheelhouse")[0]

				        print(f"Moving {repaired_wheel_name} wheel to /{folder}/dist")

				        os.rename(

				            f"/{folder}/wheelhouse/{repaired_wheel_name}",

				            f"/{folder}/dist/{repaired_wheel_name}",

				        )

				    else:

				        repaired_wheel_name = wheel_name

				    print(f"Copying {repaired_wheel_name} to artifacts")

				    shutil.copy2(

				        f"/{folder}/dist/{repaired_wheel_name}", f"/artifacts/{repaired_wheel_name}"

				    )

				    return repaired_wheel_name

				def parse_arguments():

				    """

				    Parse inline arguments

				    """

				    from argparse import ArgumentParser

				    parser = ArgumentParser("AARCH64 wheels python CD")

				    parser.add_argument("--debug", action="store_true")

				    parser.add_argument("--build-only", action="store_true")

				    parser.add_argument("--test-only", type=str)

				    parser.add_argument("--enable-mkldnn", action="store_true")

				    parser.add_argument("--enable-cuda", action="store_true")

				    return parser.parse_args()

				if __name__ == "__main__":

				    """

				    Entry Point

				    """

				    args = parse_arguments()

				    enable_mkldnn = args.enable_mkldnn

				    enable_cuda = args.enable_cuda

				    repo = Repository("/pytorch")

				    branch = repo.head.name

				    if branch == "HEAD":

				        branch = "master"

				    print("Building PyTorch wheel")

				    build_vars = "MAX_JOBS=5 CMAKE_SHARED_LINKER_FLAGS=-Wl,-z,max-page-size=0x10000 "

				    os.system("cd /pytorch; python setup.py clean")

				    override_package_version = os.getenv("OVERRIDE_PACKAGE_VERSION")

				    if override_package_version is not None:

				        version = override_package_version

				        build_vars += (

				            f"BUILD_TEST=0 PYTORCH_BUILD_VERSION={version} PYTORCH_BUILD_NUMBER=1 "

				        )

				    elif branch in ["nightly", "master"]:

				        build_date = (

				            check_output(["git", "log", "--pretty=format:%cs", "-1"], cwd="/pytorch")

				            .decode()

				            .replace("-", "")

				        )

				        version = (

				            check_output(["cat", "version.txt"], cwd="/pytorch").decode().strip()[:-2]

				        )

				        if enable_cuda:

				            desired_cuda = os.getenv("DESIRED_CUDA")

				            build_vars += f"BUILD_TEST=0 PYTORCH_BUILD_VERSION={version}.dev{build_date}+{desired_cuda} PYTORCH_BUILD_NUMBER=1 "

				        else:

				            build_vars += f"BUILD_TEST=0 PYTORCH_BUILD_VERSION={version}.dev{build_date} PYTORCH_BUILD_NUMBER=1 "

				    elif branch.startswith(("v1.", "v2.")):

				        build_vars += f"BUILD_TEST=0 PYTORCH_BUILD_VERSION={branch[1:branch.find('-')]} PYTORCH_BUILD_NUMBER=1 "

				    if enable_mkldnn:

				        build_ArmComputeLibrary()

				        print("build pytorch with mkldnn+acl backend")

				        build_vars += (

				            "USE_MKLDNN=ON USE_MKLDNN_ACL=ON "

				            "ACL_ROOT_DIR=/acl "

				            "LD_LIBRARY_PATH=/pytorch/build/lib:/acl/build:$LD_LIBRARY_PATH "

				            "ACL_INCLUDE_DIR=/acl/build "

				            "ACL_LIBRARY=/acl/build "

				        )

				        if enable_cuda:

				            build_vars += "BLAS=NVPL "

				        else:

				            build_vars += "BLAS=OpenBLAS OpenBLAS_HOME=/OpenBLAS "

				    else:

				        print("build pytorch without mkldnn backend")

				    os.system(f"cd /pytorch; {build_vars} python3 setup.py bdist_wheel")

				    if enable_cuda:

				        print("Updating Cuda Dependency")

				        filename = os.listdir("/pytorch/dist/")

				        wheel_path = f"/pytorch/dist/{filename[0]}"

				        update_wheel(wheel_path)

				    pytorch_wheel_name = complete_wheel("/pytorch/")

				    print(f"Build Complete. Created {pytorch_wheel_name}..")

1043

.ci/aarch64_linux/build_aarch64_wheel.py Executable file

View File

File diff suppressed because it is too large Load Diff

									
										87

.ci/aarch64_linux/embed_library.py
									
										Normal file
									
												View File
												
				@ -0,0 +1,87 @@

				#!/usr/bin/env python3

				import os

				import shutil

				import sys

				from subprocess import check_call

				from tempfile import TemporaryDirectory

				from auditwheel.elfutils import elf_file_filter

				from auditwheel.lddtree import lddtree

				from auditwheel.patcher import Patchelf

				from auditwheel.repair import copylib

				from auditwheel.wheeltools import InWheelCtx

				def replace_tag(filename):

				    with open(filename) as f:

				        lines = f.read().split("\\n")

				    for i, line in enumerate(lines):

				        if not line.startswith("Tag: "):

				            continue

				        lines[i] = line.replace("-linux_", "-manylinux2014_")

				        print(f"Updated tag from {line} to {lines[i]}")

				    with open(filename, "w") as f:

				        f.write("\\n".join(lines))

				class AlignedPatchelf(Patchelf):

				    def set_soname(self, file_name: str, new_soname: str) -> None:

				        check_call(

				            ["patchelf", "--page-size", "65536", "--set-soname", new_soname, file_name]

				        )

				    def replace_needed(self, file_name: str, soname: str, new_soname: str) -> None:

				        check_call(

				            [

				                "patchelf",

				                "--page-size",

				                "65536",

				                "--replace-needed",

				                soname,

				                new_soname,

				                file_name,

				            ]

				        )

				def embed_library(whl_path, lib_soname, update_tag=False):

				    patcher = AlignedPatchelf()

				    out_dir = TemporaryDirectory()

				    whl_name = os.path.basename(whl_path)

				    tmp_whl_name = os.path.join(out_dir.name, whl_name)

				    with InWheelCtx(whl_path) as ctx:

				        torchlib_path = os.path.join(ctx._tmpdir.name, "torch", "lib")

				        ctx.out_wheel = tmp_whl_name

				        new_lib_path, new_lib_soname = None, None

				        for filename, _ in elf_file_filter(ctx.iter_files()):

				            if not filename.startswith("torch/lib"):

				                continue

				            libtree = lddtree(filename)

				            if lib_soname not in libtree["needed"]:

				                continue

				            lib_path = libtree["libs"][lib_soname]["path"]

				            if lib_path is None:

				                print(f"Can't embed {lib_soname} as it could not be found")

				                break

				            if lib_path.startswith(torchlib_path):

				                continue

				            if new_lib_path is None:

				                new_lib_soname, new_lib_path = copylib(lib_path, torchlib_path, patcher)

				            patcher.replace_needed(filename, lib_soname, new_lib_soname)

				            print(f"Replacing {lib_soname} with {new_lib_soname} for {filename}")

				        if update_tag:

				            # Add manylinux2014 tag

				            for filename in ctx.iter_files():

				                if os.path.basename(filename) != "WHEEL":

				                    continue

				                replace_tag(filename)

				    shutil.move(tmp_whl_name, whl_path)

				if __name__ == "__main__":

				    embed_library(

				        sys.argv[1], "libgomp.so.1", len(sys.argv) > 2 and sys.argv[2] == "--update-tag"

				    )

									
										58

.ci/docker/conda/Dockerfile → .ci/docker/almalinux/Dockerfile
									
												View File
												
				@ -1,47 +1,39 @@

				ARG CUDA_VERSION=10.2

				ARG CUDA_VERSION=12.4

				ARG BASE_TARGET=cuda${CUDA_VERSION}

				FROM centos:7 as base

				FROM amd64/almalinux:8 as base

				ENV LC_ALL en_US.UTF-8

				ENV LANG en_US.UTF-8

				ENV LANGUAGE en_US.UTF-8

				ARG DEVTOOLSET_VERSION=9

				RUN sed -i s/mirror.centos.org/vault.centos.org/g /etc/yum.repos.d/*.repo

				RUN sed -i s/^#.*baseurl=http/baseurl=http/g /etc/yum.repos.d/*.repo

				RUN sed -i s/^mirrorlist=http/#mirrorlist=http/g /etc/yum.repos.d/*.repo

				RUN yum update -y

				RUN yum install -y wget curl perl util-linux xz bzip2 git patch which unzip

				ARG DEVTOOLSET_VERSION=11

				ENV LC_ALL en_US.UTF-8

				ENV LANG en_US.UTF-8

				ENV LANGUAGE en_US.UTF-8

				RUN yum -y update

				RUN yum -y install epel-release

				RUN yum install -y sudo wget curl perl util-linux xz bzip2 git patch which perl zlib-devel openssl-devel yum-utils autoconf automake make gcc-toolset-${DEVTOOLSET_VERSION}-toolchain

				# Just add everything as a safe.directory for git since these will be used in multiple places with git

				RUN git config --global --add safe.directory '*'

				RUN yum install -y yum-utils centos-release-scl

				RUN yum-config-manager --enable rhel-server-rhscl-7-rpms

				RUN sed -i s/mirror.centos.org/vault.centos.org/g /etc/yum.repos.d/*.repo

				RUN sed -i s/^#.*baseurl=http/baseurl=http/g /etc/yum.repos.d/*.repo

				RUN sed -i s/^mirrorlist=http/#mirrorlist=http/g /etc/yum.repos.d/*.repo

				RUN yum install -y devtoolset-${DEVTOOLSET_VERSION}-gcc devtoolset-${DEVTOOLSET_VERSION}-gcc-c++ devtoolset-${DEVTOOLSET_VERSION}-gcc-gfortran devtoolset-${DEVTOOLSET_VERSION}-binutils

				# EPEL for cmake

				RUN yum --enablerepo=extras install -y epel-release

				ENV PATH=/opt/rh/gcc-toolset-${DEVTOOLSET_VERSION}/root/usr/bin:$PATH

				# cmake

				RUN yum install -y cmake3 && \

				    ln -s /usr/bin/cmake3 /usr/bin/cmake

				ENV PATH=/opt/rh/devtoolset-${DEVTOOLSET_VERSION}/root/usr/bin:$PATH

				ENV LD_LIBRARY_PATH=/opt/rh/devtoolset-${DEVTOOLSET_VERSION}/root/usr/lib64:/opt/rh/devtoolset-${DEVTOOLSET_VERSION}/root/usr/lib:$LD_LIBRARY_PATH

				RUN yum install -y autoconf aclocal automake make sudo

				# cmake-3.18.4 from pip

				RUN yum install -y python3-pip && \

				    python3 -mpip install cmake==3.18.4 && \

				    ln -s /usr/local/bin/cmake /usr/bin/cmake3

				RUN rm -rf /usr/local/cuda-*

				FROM base as openssl

				ADD ./common/install_openssl.sh install_openssl.sh

				RUN bash ./install_openssl.sh && rm install_openssl.sh

				FROM base as patchelf

				# Install patchelf

				ADD ./common/install_patchelf.sh install_patchelf.sh

				RUN bash ./install_patchelf.sh && rm install_patchelf.sh && cp $(which patchelf) /patchelf

				FROM base as openssl

				# Install openssl

				ADD ./common/install_openssl.sh install_openssl.sh

				RUN bash ./install_openssl.sh && rm install_openssl.sh

				FROM base as conda

				# Install Anaconda

				ADD ./common/install_conda_docker.sh install_conda.sh

				@ -49,7 +41,7 @@ RUN bash ./install_conda.sh && rm install_conda.sh

				# Install CUDA

				FROM base as cuda

				ARG CUDA_VERSION=10.2

				ARG CUDA_VERSION=12.4

				RUN rm -rf /usr/local/cuda-*

				ADD ./common/install_cuda.sh install_cuda.sh

				ENV CUDA_HOME=/usr/local/cuda-${CUDA_VERSION}

				@ -70,6 +62,10 @@ FROM cuda as cuda12.4

				RUN bash ./install_cuda.sh 12.4

				ENV DESIRED_CUDA=12.4

				FROM cuda as cuda12.6

				RUN bash ./install_cuda.sh 12.6

				ENV DESIRED_CUDA=12.6

				# Install MNIST test data

				FROM base as mnist

				ADD ./common/install_mnist.sh install_mnist.sh

				@ -79,6 +75,7 @@ FROM base as all_cuda

				COPY --from=cuda11.8  /usr/local/cuda-11.8 /usr/local/cuda-11.8

				COPY --from=cuda12.1  /usr/local/cuda-12.1 /usr/local/cuda-12.1

				COPY --from=cuda12.4  /usr/local/cuda-12.4 /usr/local/cuda-12.4

				COPY --from=cuda12.6  /usr/local/cuda-12.6 /usr/local/cuda-12.6

				# Final step

				FROM ${BASE_TARGET} as final

				@ -91,7 +88,8 @@ COPY ./common/install_jni.sh install_jni.sh

				COPY ./java/jni.h jni.h

				RUN bash ./install_jni.sh && rm install_jni.sh

				ENV  PATH /opt/conda/bin:$PATH

				ENV PATH /opt/conda/bin:$PATH

				ENV LD_LIBRARY_PATH=/opt/rh/gcc-toolset-${DEVTOOLSET_VERSION}/root/usr/lib64:/opt/rh/gcc-toolset-${DEVTOOLSET_VERSION}/root/usr/lib:$LD_LIBRARY_PATH

				COPY --from=mnist  /usr/local/mnist /usr/local/mnist

				RUN rm -rf /usr/local/cuda

				RUN chmod o+rw /usr/local

									
										10

.ci/docker/conda/build.sh → .ci/docker/almalinux/build.sh
									
												View File
												
				@ -37,15 +37,21 @@ esac

				(

				  set -x

				  # TODO: Remove LimitNOFILE=1048576 patch once https://github.com/pytorch/test-infra/issues/5712

				  # is resolved. This patch is required in order to fix timing out of Docker build on Amazon Linux 2023.

				  sudo sed -i s/LimitNOFILE=infinity/LimitNOFILE=1048576/ /usr/lib/systemd/system/docker.service

				  sudo systemctl daemon-reload

				  sudo systemctl restart docker

				  docker build \

				    --target final \

				    --progress plain \

				    --build-arg "BASE_TARGET=${BASE_TARGET}" \

				    --build-arg "CUDA_VERSION=${CUDA_VERSION}" \

				    --build-arg "DEVTOOLSET_VERSION=9" \

				    --build-arg "DEVTOOLSET_VERSION=11" \

				    -t ${DOCKER_IMAGE_NAME} \

				    $@ \

				    -f "${TOPDIR}/.ci/docker/conda/Dockerfile" \

				    -f "${TOPDIR}/.ci/docker/almalinux/Dockerfile" \

				    ${TOPDIR}/.ci/docker/

				)

									
										1

.ci/docker/android/AndroidManifest.xml
									
												View File
											
				@ -1 +0,0 @@

				<manifest package="org.pytorch.deps" />

									
										66

.ci/docker/android/build.gradle
									
												View File
											
				@ -1,66 +0,0 @@

				buildscript {

				    ext {

				        minSdkVersion = 21

				        targetSdkVersion = 28

				        compileSdkVersion = 28

				        buildToolsVersion = '28.0.3'

				        coreVersion = "1.2.0"

				        extJUnitVersion = "1.1.1"

				        runnerVersion = "1.2.0"

				        rulesVersion = "1.2.0"

				        junitVersion = "4.12"

				    }

				    repositories {

				        google()

				        mavenLocal()

				        mavenCentral()

				        jcenter()

				    }

				    dependencies {

				        classpath 'com.android.tools.build:gradle:4.1.2'

				        classpath 'com.vanniktech:gradle-maven-publish-plugin:0.14.2'

				    }

				}

				repositories {

				    google()

				    jcenter()

				}

				apply plugin: 'com.android.library'

				android {

				    compileSdkVersion rootProject.compileSdkVersion

				    buildToolsVersion rootProject.buildToolsVersion

				    defaultConfig {

				        minSdkVersion minSdkVersion

				        targetSdkVersion targetSdkVersion

				    }

				    sourceSets {

				        main {

				            manifest.srcFile 'AndroidManifest.xml'

				        }

				    }

				}

				dependencies {

				    implementation 'com.android.support:appcompat-v7:28.0.0'

				    implementation 'androidx.appcompat:appcompat:1.0.0'

				    implementation 'com.facebook.fbjni:fbjni-java-only:0.2.2'

				    implementation 'com.google.code.findbugs:jsr305:3.0.1'

				    implementation 'com.facebook.soloader:nativeloader:0.10.5'

				    implementation 'junit:junit:' + rootProject.junitVersion

				    implementation 'androidx.test:core:' + rootProject.coreVersion

				    implementation 'junit:junit:' + rootProject.junitVersion

				    implementation 'androidx.test:core:' + rootProject.coreVersion

				    implementation 'androidx.test.ext:junit:' + rootProject.extJUnitVersion

				    implementation 'androidx.test:rules:' + rootProject.rulesVersion

				    implementation 'androidx.test:runner:' + rootProject.runnerVersion

				}

8

.ci/docker/aotriton_version.txt

View File

 @ -1,5 +1,5 @@
 .6b
 manylinux_2_17
 .8b
 manylinux_2_28
 rocm6.2
 f07e8a1cb1f99627eb6d77f5c0e9295c775f3c7
 e4ab195d2bd19e939c675a13280c29714c6ef9f2cf420690da150fa0cac043b1
 f8cbcac8a92775291bb1ba8f514d4beb350baf4
 e938def5d32869fe2e00aec0300f354c9f157867bebdf2e104d732b94cb238d8

									
										95

.ci/docker/build.sh
									
												View File
												
				@ -179,10 +179,10 @@ case "$image" in

				    TRITON=yes

				    INDUCTOR_BENCHMARKS=yes

				    ;;

				  pytorch-linux-focal-cuda11.8-cudnn9-py3-gcc9)

				    CUDA_VERSION=11.8.0

				  pytorch-linux-focal-cuda12.4-cudnn9-py3.13-gcc9-inductor-benchmarks)

				    CUDA_VERSION=12.4.1

				    CUDNN_VERSION=9

				    ANACONDA_PYTHON_VERSION=3.10

				    ANACONDA_PYTHON_VERSION=3.13

				    GCC_VERSION=9

				    PROTOBUF=yes

				    DB=yes

				@ -192,9 +192,10 @@ case "$image" in

				    UCC_COMMIT=${_UCC_COMMIT}

				    CONDA_CMAKE=yes

				    TRITON=yes

				    INDUCTOR_BENCHMARKS=yes

				    ;;

				  pytorch-linux-focal-cuda12.4-cudnn9-py3-gcc9)

				    CUDA_VERSION=12.4.1

				  pytorch-linux-focal-cuda11.8-cudnn9-py3-gcc9)

				    CUDA_VERSION=11.8.0

				    CUDNN_VERSION=9

				    ANACONDA_PYTHON_VERSION=3.10

				    GCC_VERSION=9

				@ -221,20 +222,6 @@ case "$image" in

				    CONDA_CMAKE=yes

				    TRITON=yes

				    ;;

				  pytorch-linux-focal-cuda12.4-cudnn9-py3-gcc9)

				    CUDA_VERSION=12.4.1

				    CUDNN_VERSION=9

				    ANACONDA_PYTHON_VERSION=3.10

				    GCC_VERSION=9

				    PROTOBUF=yes

				    DB=yes

				    VISION=yes

				    KATEX=yes

				    UCX_COMMIT=${_UCX_COMMIT}

				    UCC_COMMIT=${_UCC_COMMIT}

				    CONDA_CMAKE=yes

				    TRITON=yes

				    ;;

				  pytorch-linux-focal-py3-clang10-onnx)

				    ANACONDA_PYTHON_VERSION=3.9

				    CLANG_VERSION=10

				@ -244,16 +231,6 @@ case "$image" in

				    CONDA_CMAKE=yes

				    ONNX=yes

				    ;;

				  pytorch-linux-focal-py3-clang9-android-ndk-r21e)

				    ANACONDA_PYTHON_VERSION=3.9

				    CLANG_VERSION=9

				    LLVMDEV=yes

				    PROTOBUF=yes

				    ANDROID=yes

				    ANDROID_NDK_VERSION=r21e

				    GRADLE_VERSION=6.8.3

				    NINJA_VERSION=1.9.0

				    ;;

				  pytorch-linux-focal-py3.9-clang10)

				    ANACONDA_PYTHON_VERSION=3.9

				    CLANG_VERSION=10

				@ -286,18 +263,7 @@ case "$image" in

				    TRITON=yes

				    ;;

				  pytorch-linux-focal-rocm-n-1-py3)

				    ANACONDA_PYTHON_VERSION=3.8

				    GCC_VERSION=9

				    PROTOBUF=yes

				    DB=yes

				    VISION=yes

				    ROCM_VERSION=6.0

				    NINJA_VERSION=1.9.0

				    CONDA_CMAKE=yes

				    TRITON=yes

				    ;;

				  pytorch-linux-focal-rocm-n-py3)

				    ANACONDA_PYTHON_VERSION=3.8

				    ANACONDA_PYTHON_VERSION=3.10

				    GCC_VERSION=9

				    PROTOBUF=yes

				    DB=yes

				@ -307,6 +273,17 @@ case "$image" in

				    CONDA_CMAKE=yes

				    TRITON=yes

				    ;;

				  pytorch-linux-focal-rocm-n-py3)

				    ANACONDA_PYTHON_VERSION=3.10

				    GCC_VERSION=9

				    PROTOBUF=yes

				    DB=yes

				    VISION=yes

				    ROCM_VERSION=6.2.4

				    NINJA_VERSION=1.9.0

				    CONDA_CMAKE=yes

				    TRITON=yes

				    ;;

				  pytorch-linux-jammy-xpu-2024.0-py3)

				    ANACONDA_PYTHON_VERSION=3.9

				    GCC_VERSION=11

				@ -318,6 +295,17 @@ case "$image" in

				    CONDA_CMAKE=yes

				    TRITON=yes

				    ;;

				  pytorch-linux-jammy-xpu-2025.0-py3)

				    ANACONDA_PYTHON_VERSION=3.9

				    GCC_VERSION=11

				    PROTOBUF=yes

				    DB=yes

				    VISION=yes

				    XPU_VERSION=2025.0

				    NINJA_VERSION=1.9.0

				    CONDA_CMAKE=yes

				    TRITON=yes

				    ;;

				    pytorch-linux-jammy-py3.9-gcc11-inductor-benchmarks)

				    ANACONDA_PYTHON_VERSION=3.9

				    GCC_VERSION=11

				@ -355,6 +343,12 @@ case "$image" in

				    CONDA_CMAKE=yes

				    VISION=yes

				    ;;

				  pytorch-linux-jammy-py3-clang18-asan)

				    ANACONDA_PYTHON_VERSION=3.10

				    CLANG_VERSION=18

				    CONDA_CMAKE=yes

				    VISION=yes

				    ;;

				  pytorch-linux-jammy-py3.9-gcc11)

				    ANACONDA_PYTHON_VERSION=3.9

				    GCC_VERSION=11

				@ -379,6 +373,14 @@ case "$image" in

				    GCC_VERSION=11

				    CONDA_CMAKE=yes

				    HALIDE=yes

				    TRITON=yes

				    ;;

				  pytorch-linux-jammy-py3.12-triton-cpu)

				    CUDA_VERSION=12.4

				    ANACONDA_PYTHON_VERSION=3.12

				    GCC_VERSION=11

				    CONDA_CMAKE=yes

				    TRITON_CPU=yes

				    ;;

				  pytorch-linux-focal-linter)

				    # TODO: Use 3.9 here because of this issue https://github.com/python/mypy/issues/13627.

				@ -400,9 +402,6 @@ case "$image" in

				    DB=yes

				    VISION=yes

				    CONDA_CMAKE=yes

				    # snadampal: skipping sccache due to the following issue

				    # https://github.com/pytorch/pytorch/issues/121559

				    SKIP_SCCACHE_INSTALL=yes

				    # snadampal: skipping llvm src build install because the current version

				    # from pytorch/llvm:9.0.1 is x86 specific

				    SKIP_LLVM_SRC_BUILD_INSTALL=yes

				@ -415,9 +414,6 @@ case "$image" in

				    DB=yes

				    VISION=yes

				    CONDA_CMAKE=yes

				    # snadampal: skipping sccache due to the following issue

				    # https://github.com/pytorch/pytorch/issues/121559

				    SKIP_SCCACHE_INSTALL=yes

				    # snadampal: skipping llvm src build install because the current version

				    # from pytorch/llvm:9.0.1 is x86 specific

				    SKIP_LLVM_SRC_BUILD_INSTALL=yes

				@ -494,8 +490,6 @@ docker build \

				       --build-arg "CUDA_VERSION=${CUDA_VERSION}" \

				       --build-arg "CUDNN_VERSION=${CUDNN_VERSION}" \

				       --build-arg "TENSORRT_VERSION=${TENSORRT_VERSION}" \

				       --build-arg "ANDROID=${ANDROID}" \

				       --build-arg "ANDROID_NDK=${ANDROID_NDK_VERSION}" \

				       --build-arg "GRADLE_VERSION=${GRADLE_VERSION}" \

				       --build-arg "VULKAN_SDK_VERSION=${VULKAN_SDK_VERSION}" \

				       --build-arg "SWIFTSHADER=${SWIFTSHADER}" \

				@ -503,12 +497,13 @@ docker build \

				       --build-arg "NINJA_VERSION=${NINJA_VERSION:-}" \

				       --build-arg "KATEX=${KATEX:-}" \

				       --build-arg "ROCM_VERSION=${ROCM_VERSION:-}" \

				       --build-arg "PYTORCH_ROCM_ARCH=${PYTORCH_ROCM_ARCH:-gfx906;gfx90a}" \

				       --build-arg "PYTORCH_ROCM_ARCH=${PYTORCH_ROCM_ARCH:-gfx90a}" \

				       --build-arg "IMAGE_NAME=${IMAGE_NAME}" \

				       --build-arg "UCX_COMMIT=${UCX_COMMIT}" \

				       --build-arg "UCC_COMMIT=${UCC_COMMIT}" \

				       --build-arg "CONDA_CMAKE=${CONDA_CMAKE}" \

				       --build-arg "TRITON=${TRITON}" \

				       --build-arg "TRITON_CPU=${TRITON_CPU}" \

				       --build-arg "ONNX=${ONNX}" \

				       --build-arg "DOCS=${DOCS}" \

				       --build-arg "INDUCTOR_BENCHMARKS=${INDUCTOR_BENCHMARKS}" \

									
										4

.ci/docker/centos-rocm/Dockerfile
									
												View File
												
				@ -108,10 +108,10 @@ ENV CMAKE_C_COMPILER cc

				ENV CMAKE_CXX_COMPILER c++

				COPY ./common/install_triton.sh install_triton.sh

				COPY ./common/common_utils.sh common_utils.sh

				COPY ci_commit_pins/triton-rocm.txt triton-rocm.txt

				COPY ci_commit_pins/triton.txt triton.txt

				COPY triton_version.txt triton_version.txt

				RUN if [ -n "${TRITON}" ]; then bash ./install_triton.sh; fi

				RUN rm install_triton.sh common_utils.sh triton-rocm.txt triton_version.txt

				RUN rm install_triton.sh common_utils.sh triton.txt triton_version.txt

				# Install AOTriton (Early fail)

				COPY ./aotriton_version.txt aotriton_version.txt

2

.ci/docker/ci_commit_pins/executorch.txt

View File

 @ -1 +1 @@
 cd1c833b079adb324871dcbbe75b43d42ffc0ade
 a29b208a06ab378bb29ab1aa68932e412f8e09f1

1

.ci/docker/ci_commit_pins/triton-cpu.txt Normal file

View File

				`@ -0,0 +1 @@`
				`c7711371cace304afe265c1ffa906415ab82fc66`

1

.ci/docker/ci_commit_pins/triton-rocm.txt

View File

				`@ -1 +0,0 @@`
				`21eae954efa5bf584da70324b640288c3ee7aede`

2

.ci/docker/ci_commit_pins/triton-xpu.txt

View File

 @ -1 +1 @@
 b2f15840e0d70eec50d84c7a0575cb835524def
 e98b6fcb8df5b44eb0d0addb6767c573d37ba024

2

.ci/docker/ci_commit_pins/triton.txt

View File

 @ -1 +1 @@
 dedb7bdf339a3546896d4820366ca562c586bfa0
 d4682f073ded4d1a8260dd4208a43d735ae3a2b

									
										112

.ci/docker/common/install_android.sh
									
												View File
											
				@ -1,112 +0,0 @@

				#!/bin/bash

				set -ex

				[ -n "${ANDROID_NDK}" ]

				_https_amazon_aws=https://ossci-android.s3.amazonaws.com

				apt-get update

				apt-get install -y --no-install-recommends autotools-dev autoconf unzip

				apt-get autoclean && apt-get clean

				rm -rf /var/lib/apt/lists/* /tmp/* /var/tmp/*

				pushd /tmp

				curl -Os --retry 3 $_https_amazon_aws/android-ndk-${ANDROID_NDK}-linux-x86_64.zip

				popd

				_ndk_dir=/opt/ndk

				mkdir -p "$_ndk_dir"

				unzip -qo /tmp/android*.zip -d "$_ndk_dir"

				_versioned_dir=$(find "$_ndk_dir/" -mindepth 1 -maxdepth 1 -type d)

				mv "$_versioned_dir"/* "$_ndk_dir"/

				rmdir "$_versioned_dir"

				rm -rf /tmp/*

				# Install OpenJDK

				# https://hub.docker.com/r/picoded/ubuntu-openjdk-8-jdk/dockerfile/

				sudo apt-get update && \

				    apt-get install -y openjdk-8-jdk && \

				    apt-get install -y ant && \

				    apt-get clean && \

				    rm -rf /var/lib/apt/lists/* && \

				    rm -rf /var/cache/oracle-jdk8-installer;

				# Fix certificate issues, found as of

				# https://bugs.launchpad.net/ubuntu/+source/ca-certificates-java/+bug/983302

				sudo apt-get update && \

				    apt-get install -y ca-certificates-java && \

				    apt-get clean && \

				    update-ca-certificates -f && \

				    rm -rf /var/lib/apt/lists/* && \

				    rm -rf /var/cache/oracle-jdk8-installer;

				export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/

				# Installing android sdk

				# https://github.com/circleci/circleci-images/blob/staging/android/Dockerfile.m4

				_tmp_sdk_zip=/tmp/android-sdk-linux.zip

				_android_home=/opt/android/sdk

				rm -rf $_android_home

				sudo mkdir -p $_android_home

				curl --silent --show-error --location --fail --retry 3 --output /tmp/android-sdk-linux.zip $_https_amazon_aws/android-sdk-linux-tools3859397-build-tools2803-2902-platforms28-29.zip

				sudo unzip -q $_tmp_sdk_zip -d $_android_home

				rm $_tmp_sdk_zip

				sudo chmod -R 777 $_android_home

				export ANDROID_HOME=$_android_home

				export ADB_INSTALL_TIMEOUT=120

				export PATH="${ANDROID_HOME}/tools:${ANDROID_HOME}/tools/bin:${ANDROID_HOME}/platform-tools:${PATH}"

				echo "PATH:${PATH}"

				# Installing Gradle

				echo "GRADLE_VERSION:${GRADLE_VERSION}"

				_gradle_home=/opt/gradle

				sudo rm -rf $gradle_home

				sudo mkdir -p $_gradle_home

				curl --silent --output /tmp/gradle.zip --retry 3 $_https_amazon_aws/gradle-${GRADLE_VERSION}-bin.zip

				sudo unzip -q /tmp/gradle.zip -d $_gradle_home

				rm /tmp/gradle.zip

				sudo chmod -R 777 $_gradle_home

				export GRADLE_HOME=$_gradle_home/gradle-$GRADLE_VERSION

				alias gradle="${GRADLE_HOME}/bin/gradle"

				export PATH="${GRADLE_HOME}/bin/:${PATH}"

				echo "PATH:${PATH}"

				gradle --version

				mkdir /var/lib/jenkins/gradledeps

				cp build.gradle /var/lib/jenkins/gradledeps

				cp AndroidManifest.xml /var/lib/jenkins/gradledeps

				pushd /var/lib/jenkins

				export GRADLE_LOCAL_PROPERTIES=gradledeps/local.properties

				rm -f $GRADLE_LOCAL_PROPERTIES

				echo "sdk.dir=/opt/android/sdk" >> $GRADLE_LOCAL_PROPERTIES

				echo "ndk.dir=/opt/ndk" >> $GRADLE_LOCAL_PROPERTIES

				chown -R jenkins /var/lib/jenkins/gradledeps

				chgrp -R jenkins /var/lib/jenkins/gradledeps

				sudo -H -u jenkins $GRADLE_HOME/bin/gradle -Pandroid.useAndroidX=true -p /var/lib/jenkins/gradledeps -g /var/lib/jenkins/.gradle --refresh-dependencies --debug --stacktrace assemble

				chown -R jenkins /var/lib/jenkins/.gradle

				chgrp -R jenkins /var/lib/jenkins/.gradle

				popd

				rm -rf /var/lib/jenkins/.gradle/daemon

				# Cache vision models used by the test

				source "$(dirname "${BASH_SOURCE[0]}")/cache_vision_models.sh"

									
										4

.ci/docker/common/install_aotriton.sh
									
												View File
												
				@ -4,12 +4,12 @@ set -ex

				source "$(dirname "${BASH_SOURCE[0]}")/common_utils.sh"

				TARBALL='aotriton.tar.bz2'

				TARBALL='aotriton.tar.gz'

				# This read command alwasy returns with exit code 1

				read -d "\n" VER MANYLINUX ROCMBASE PINNED_COMMIT SHA256 < aotriton_version.txt || true

				ARCH=$(uname -m)

				AOTRITON_INSTALL_PREFIX="$1"

				AOTRITON_URL="https://github.com/ROCm/aotriton/releases/download/${VER}/aotriton-${VER}-${MANYLINUX}_${ARCH}-${ROCMBASE}-shared.tar.bz2"

				AOTRITON_URL="https://github.com/ROCm/aotriton/releases/download/${VER}/aotriton-${VER}-${MANYLINUX}_${ARCH}-${ROCMBASE}-shared.tar.gz"

				cd "${AOTRITON_INSTALL_PREFIX}"

				# Must use -L to follow redirects

									
										3

.ci/docker/common/install_base.sh
									
												View File
												
				@ -76,7 +76,8 @@ install_ubuntu() {

				    vim \

				    unzip \

				    gpg-agent \

				    gdb

				    gdb \

				    bc

				  # Should resolve issues related to various apt package repository cert issues

				  # see: https://github.com/pytorch/pytorch/issues/65931

									
										50

.ci/docker/common/install_cache.sh
									
												View File
												
				@ -9,7 +9,7 @@ install_ubuntu() {

				  # Instead use lib and headers from OpenSSL1.1 installed in `install_openssl.sh``

				  apt-get install -y cargo

				  echo "Checking out sccache repo"

				  git clone https://github.com/pytorch/sccache

				  git clone https://github.com/mozilla/sccache -b v0.9.0

				  cd sccache

				  echo "Building sccache"

				  cargo build --release

				@ -19,6 +19,10 @@ install_ubuntu() {

				  rm -rf sccache

				  apt-get remove -y cargo rustc

				  apt-get autoclean && apt-get clean

				  echo "Downloading old sccache binary from S3 repo for PCH builds"

				  curl --retry 3 https://s3.amazonaws.com/ossci-linux/sccache -o /opt/cache/bin/sccache-0.2.14a

				  chmod 755 /opt/cache/bin/sccache-0.2.14a

				}

				install_binary() {

				@ -32,22 +36,42 @@ sed -e 's|PATH="\(.*\)"|PATH="/opt/cache/bin:\1"|g' -i /etc/environment

				export PATH="/opt/cache/bin:$PATH"

				# Setup compiler cache

				if [ -n "$ROCM_VERSION" ]; then

				  curl --retry 3 http://repo.radeon.com/misc/.sccache_amd/sccache -o /opt/cache/bin/sccache

				else

				  ID=$(grep -oP '(?<=^ID=).+' /etc/os-release | tr -d '"')

				  # TODO: Install the pre-built binary from S3 as building from source

				  # https://github.com/pytorch/sccache has started failing mysteriously

				  # in which sccache server couldn't start with the following error:

				  #   sccache: error: Invalid argument (os error 22)

				  install_binary

				fi

				install_ubuntu

				chmod a+x /opt/cache/bin/sccache

				function write_sccache_stub() {

				  # Unset LD_PRELOAD for ps because of asan + ps issues

				  # https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90589

				  printf "#!/bin/sh\nif [ \$(env -u LD_PRELOAD ps -p \$PPID -o comm=) != sccache ]; then\n  exec sccache $(which $1) \"\$@\"\nelse\n  exec $(which $1) \"\$@\"\nfi" > "/opt/cache/bin/$1"

				  if [ $1 == "gcc" ]; then

				    # Do not call sccache recursively when dumping preprocessor argument

				    # For some reason it's very important for the first cached nvcc invocation

				    cat >"/opt/cache/bin/$1" <<EOF

				#!/bin/sh

				# sccache does not support -E flag, so we need to call the original compiler directly in order to avoid calling this wrapper recursively

				for arg in "\$@"; do

				  if [ "\$arg" = "-E" ]; then

				    exec $(which $1) "\$@"

				  fi

				done

				if [ \$(env -u LD_PRELOAD ps -p \$PPID -o comm=) != sccache ]; then

				  exec sccache $(which $1) "\$@"

				else

				  exec $(which $1) "\$@"

				fi

				EOF

				  else

				    cat >"/opt/cache/bin/$1" <<EOF

				#!/bin/sh

				if [ \$(env -u LD_PRELOAD ps -p \$PPID -o comm=) != sccache ]; then

				  exec sccache $(which $1) "\$@"

				else

				  exec $(which $1) "\$@"

				fi

				EOF

				  fi

				  chmod a+x "/opt/cache/bin/$1"

				}

				@ -88,7 +112,7 @@ if [ -n "$ROCM_VERSION" ]; then

				    TOPDIR=$(dirname $OLDCOMP)

				    WRAPPED="$TOPDIR/original/$COMPNAME"

				    mv "$OLDCOMP" "$WRAPPED"

				    printf "#!/bin/sh\nexec sccache $WRAPPED \"\$@\"" > "$OLDCOMP"

				    printf "#!/bin/sh\nexec sccache $WRAPPED \"\$@\"" >"$OLDCOMP"

				    chmod a+x "$OLDCOMP"

				  }

									
										11

.ci/docker/common/install_clang.sh
									
												View File
												
				@ -13,11 +13,18 @@ if [ -n "$CLANG_VERSION" ]; then

				  elif [[ $UBUNTU_VERSION == 22.04 ]]; then

				    # work around ubuntu apt-get conflicts

				    sudo apt-get -y -f install

				    wget --no-check-certificate -O - https://apt.llvm.org/llvm-snapshot.gpg.key | sudo apt-key add  -

				    if [[ $CLANG_VERSION == 18 ]]; then

				      apt-add-repository "deb http://apt.llvm.org/jammy/ llvm-toolchain-jammy-18 main"

				    fi

				  fi

				  sudo apt-get update

				  apt-get install -y --no-install-recommends clang-"$CLANG_VERSION"

				  apt-get install -y --no-install-recommends llvm-"$CLANG_VERSION"

				  if [[ $CLANG_VERSION -ge 18 ]]; then

				    apt-get install -y libomp-${CLANG_VERSION}-dev libclang-rt-${CLANG_VERSION}-dev clang-"$CLANG_VERSION" llvm-"$CLANG_VERSION"

				  else

				    apt-get install -y --no-install-recommends clang-"$CLANG_VERSION" llvm-"$CLANG_VERSION"

				  fi

				  # Install dev version of LLVM.

				  if [ -n "$LLVMDEV" ]; then

									
										25

.ci/docker/common/install_conda.sh
									
												View File
												
				@ -25,7 +25,8 @@ if [ -n "$ANACONDA_PYTHON_VERSION" ]; then

				  mkdir -p /opt/conda

				  chown jenkins:jenkins /opt/conda

				  source "$(dirname "${BASH_SOURCE[0]}")/common_utils.sh"

				  SCRIPT_FOLDER="$( cd "$(dirname "$0")" ; pwd -P )"

				  source "${SCRIPT_FOLDER}/common_utils.sh"

				  pushd /tmp

				  wget -q "${BASE_URL}/${CONDA_FILE}"

				@ -65,23 +66,10 @@ if [ -n "$ANACONDA_PYTHON_VERSION" ]; then

				  # Install PyTorch conda deps, as per https://github.com/pytorch/pytorch README

				  if [[ $(uname -m) == "aarch64" ]]; then

				    CONDA_COMMON_DEPS="astunparse pyyaml setuptools openblas==0.3.25=*openmp* ninja==1.11.1 scons==4.5.2"

				    if [ "$ANACONDA_PYTHON_VERSION" = "3.8" ]; then

				      NUMPY_VERSION=1.24.4

				    else

				      NUMPY_VERSION=1.26.2

				    fi

				    conda_install "openblas==0.3.28=*openmp*"

				  else

				    CONDA_COMMON_DEPS="astunparse pyyaml mkl=2021.4.0 mkl-include=2021.4.0 setuptools"

				    if [ "$ANACONDA_PYTHON_VERSION" = "3.11" ] || [ "$ANACONDA_PYTHON_VERSION" = "3.12" ] || [ "$ANACONDA_PYTHON_VERSION" = "3.13" ]; then

				      NUMPY_VERSION=1.26.0

				    else

				      NUMPY_VERSION=1.21.2

				    fi

				    conda_install "mkl=2021.4.0 mkl-include=2021.4.0"

				  fi

				  conda_install ${CONDA_COMMON_DEPS}

				  # Install llvm-8 as it is required to compile llvmlite-0.30.0 from source

				  # and libpython-static for torch deploy

				@ -97,14 +85,13 @@ if [ -n "$ANACONDA_PYTHON_VERSION" ]; then

				  # Magma package names are concatenation of CUDA major and minor ignoring revision

				  # I.e. magma-cuda102 package corresponds to CUDA_VERSION=10.2 and CUDA_VERSION=10.2.89

				  # Magma is installed from a tarball in the ossci-linux bucket into the conda env

				  if [ -n "$CUDA_VERSION" ]; then

				    conda_install magma-cuda$(TMP=${CUDA_VERSION/./};echo ${TMP%.*[0-9]}) -c pytorch

				    ${SCRIPT_FOLDER}/install_magma_conda.sh $(cut -f1-2 -d'.' <<< ${CUDA_VERSION}) ${ANACONDA_PYTHON_VERSION}

				  fi

				  # Install some other packages, including those needed for Python test reporting

				  pip_install -r /opt/conda/requirements-ci.txt

				  pip_install numpy=="$NUMPY_VERSION"

				  pip_install -U scikit-learn

				  if [ -n "$DOCS" ]; then

				    apt-get update

									
										24

.ci/docker/common/install_cpython.sh
									
												View File
												
				@ -7,7 +7,7 @@ PYTHON_DOWNLOAD_GITHUB_BRANCH=https://github.com/python/cpython/archive/refs/hea

				GET_PIP_URL=https://bootstrap.pypa.io/get-pip.py

				# Python versions to be installed in /opt/$VERSION_NO

				CPYTHON_VERSIONS=${CPYTHON_VERSIONS:-"3.8.1 3.9.0 3.10.1 3.11.0 3.12.0 3.13.0"}

				CPYTHON_VERSIONS=${CPYTHON_VERSIONS:-"3.8.1 3.9.0 3.10.1 3.11.0 3.12.0 3.13.0 3.13.0t"}

				function check_var {

				    if [ -z "$1" ]; then

				@ -22,6 +22,13 @@ function do_cpython_build {

				    check_var $py_ver

				    check_var $py_folder

				    tar -xzf Python-$py_ver.tgz

				    local additional_flags=""

				    if [ "$py_ver" == "3.13.0t" ]; then

				        additional_flags=" --disable-gil"

				        mv cpython-3.13/ cpython-3.13t/

				    fi

				    pushd $py_folder

				    local prefix="/opt/_internal/cpython-${py_ver}"

				@ -37,8 +44,10 @@ function do_cpython_build {

				        local openssl_flags="--with-openssl=${WITH_OPENSSL} --with-openssl-rpath=auto"

				    fi

				    # -Wformat added for https://bugs.python.org/issue17547 on Python 2.6

				    CFLAGS="-Wformat" ./configure --prefix=${prefix} ${openssl_flags} ${shared_flags} > /dev/null

				    CFLAGS="-Wformat" ./configure --prefix=${prefix} ${openssl_flags} ${shared_flags} ${additional_flags} > /dev/null

				    make -j40 > /dev/null

				    make install > /dev/null

				@ -61,7 +70,7 @@ function do_cpython_build {

				    # install setuptools since python 3.12 is required to use distutils

				    ${prefix}/bin/pip install wheel==0.34.2 setuptools==68.2.2

				    local abi_tag=$(${prefix}/bin/python -c "from wheel.pep425tags import get_abbr_impl, get_impl_ver, get_abi_tag; print('{0}{1}-{2}'.format(get_abbr_impl(), get_impl_ver(), get_abi_tag()))")

				    ln -s ${prefix} /opt/python/${abi_tag}

				    ln -sf ${prefix} /opt/python/${abi_tag}

				}

				function build_cpython {

				@ -69,7 +78,14 @@ function build_cpython {

				    check_var $py_ver

				    check_var $PYTHON_DOWNLOAD_URL

				    local py_ver_folder=$py_ver

				    if [ "$py_ver" = "3.13.0" ]; then

				    if [ "$py_ver" = "3.13.0t" ]; then

				        PY_VER_SHORT="3.13"

				        PYT_VER_SHORT="3.13t"

				        check_var $PYTHON_DOWNLOAD_GITHUB_BRANCH

				        wget $PYTHON_DOWNLOAD_GITHUB_BRANCH/$PY_VER_SHORT.tar.gz -O Python-$py_ver.tgz

				        do_cpython_build $py_ver cpython-$PYT_VER_SHORT

				    elif [ "$py_ver" = "3.13.0" ]; then

				        PY_VER_SHORT="3.13"

				        check_var $PYTHON_DOWNLOAD_GITHUB_BRANCH

				        wget $PYTHON_DOWNLOAD_GITHUB_BRANCH/$PY_VER_SHORT.tar.gz -O Python-$py_ver.tgz

									
										88

.ci/docker/common/install_cuda.sh
									
												View File
												
				@ -3,7 +3,7 @@

				set -ex

				NCCL_VERSION=v2.21.5-1

				CUDNN_VERSION=9.1.0.70

				CUDNN_VERSION=9.5.1.17

				function install_cusparselt_040 {

				    # cuSparseLt license: https://docs.nvidia.com/cuda/cusparselt/license.html

				@ -38,7 +38,19 @@ function install_cusparselt_062 {

				    rm -rf tmp_cusparselt

				}

				function install_cusparselt_063 {

				    # cuSparseLt license: https://docs.nvidia.com/cuda/cusparselt/license.html

				    mkdir tmp_cusparselt && pushd tmp_cusparselt

				    wget -q https://developer.download.nvidia.com/compute/cusparselt/redist/libcusparse_lt/linux-x86_64/libcusparse_lt-linux-x86_64-0.6.3.2-archive.tar.xz

				    tar xf libcusparse_lt-linux-x86_64-0.6.3.2-archive.tar.xz

				    cp -a libcusparse_lt-linux-x86_64-0.6.3.2-archive/include/* /usr/local/cuda/include/

				    cp -a libcusparse_lt-linux-x86_64-0.6.3.2-archive/lib/* /usr/local/cuda/lib64/

				    popd

				    rm -rf tmp_cusparselt

				}

				function install_118 {

				    CUDNN_VERSION=9.1.0.70

				    echo "Installing CUDA 11.8 and cuDNN ${CUDNN_VERSION} and NCCL ${NCCL_VERSION} and cuSparseLt-0.4.0"

				    rm -rf /usr/local/cuda-11.8 /usr/local/cuda

				    # install CUDA 11.8.0 in the same container

				@ -105,7 +117,8 @@ function install_121 {

				}

				function install_124 {

				  echo "Installing CUDA 12.4.1 and cuDNN ${CUDNN_VERSION} and NCCL ${NCCL_VERSION} and cuSparseLt-0.5.2"

				  CUDNN_VERSION=9.1.0.70

				  echo "Installing CUDA 12.4.1 and cuDNN ${CUDNN_VERSION} and NCCL ${NCCL_VERSION} and cuSparseLt-0.6.2"

				  rm -rf /usr/local/cuda-12.4 /usr/local/cuda

				  # install CUDA 12.4.1 in the same container

				  wget -q https://developer.download.nvidia.com/compute/cuda/12.4.1/local_installers/cuda_12.4.1_550.54.15_linux.run

				@ -137,6 +150,39 @@ function install_124 {

				  ldconfig

				}

				function install_126 {

				  echo "Installing CUDA 12.6.3 and cuDNN ${CUDNN_VERSION} and NCCL ${NCCL_VERSION} and cuSparseLt-0.6.3"

				  rm -rf /usr/local/cuda-12.6 /usr/local/cuda

				  # install CUDA 12.6.3 in the same container

				  wget -q https://developer.download.nvidia.com/compute/cuda/12.6.3/local_installers/cuda_12.6.3_560.35.05_linux.run

				  chmod +x cuda_12.6.3_560.35.05_linux.run

				  ./cuda_12.6.3_560.35.05_linux.run --toolkit --silent

				  rm -f cuda_12.6.3_560.35.05_linux.run

				  rm -f /usr/local/cuda && ln -s /usr/local/cuda-12.6 /usr/local/cuda

				  # cuDNN license: https://developer.nvidia.com/cudnn/license_agreement

				  mkdir tmp_cudnn && cd tmp_cudnn

				  wget -q https://developer.download.nvidia.com/compute/cudnn/redist/cudnn/linux-x86_64/cudnn-linux-x86_64-${CUDNN_VERSION}_cuda12-archive.tar.xz -O cudnn-linux-x86_64-${CUDNN_VERSION}_cuda12-archive.tar.xz

				  tar xf cudnn-linux-x86_64-${CUDNN_VERSION}_cuda12-archive.tar.xz

				  cp -a cudnn-linux-x86_64-${CUDNN_VERSION}_cuda12-archive/include/* /usr/local/cuda/include/

				  cp -a cudnn-linux-x86_64-${CUDNN_VERSION}_cuda12-archive/lib/* /usr/local/cuda/lib64/

				  cd ..

				  rm -rf tmp_cudnn

				  # NCCL license: https://docs.nvidia.com/deeplearning/nccl/#licenses

				  # Follow build: https://github.com/NVIDIA/nccl/tree/master?tab=readme-ov-file#build

				  git clone -b $NCCL_VERSION --depth 1 https://github.com/NVIDIA/nccl.git

				  cd nccl && make -j src.build

				  cp -a build/include/* /usr/local/cuda/include/

				  cp -a build/lib/* /usr/local/cuda/lib64/

				  cd ..

				  rm -rf nccl

				  install_cusparselt_063

				  ldconfig

				}

				function prune_118 {

				    echo "Pruning CUDA 11.8 and cuDNN"

				    #####################################################################################

				@ -227,12 +273,46 @@ function prune_124 {

				  $NVPRUNE $GENCODE_CUDNN $CUDA_LIB_DIR/libcublasLt_static.a -o $CUDA_LIB_DIR/libcublasLt_static.a

				  #####################################################################################

				  # CUDA 12.1 prune visual tools

				  # CUDA 12.4 prune visual tools

				  #####################################################################################

				  export CUDA_BASE="/usr/local/cuda-12.4/"

				  rm -rf $CUDA_BASE/libnvvp $CUDA_BASE/nsightee_plugins $CUDA_BASE/nsight-compute-2024.1.0 $CUDA_BASE/nsight-systems-2023.4.4/

				}

				function prune_126 {

				  echo "Pruning CUDA 12.6"

				  #####################################################################################

				  # CUDA 12.6 prune static libs

				  #####################################################################################

				  export NVPRUNE="/usr/local/cuda-12.6/bin/nvprune"

				  export CUDA_LIB_DIR="/usr/local/cuda-12.6/lib64"

				  export GENCODE="-gencode arch=compute_50,code=sm_50 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_90,code=sm_90"

				  export GENCODE_CUDNN="-gencode arch=compute_50,code=sm_50 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_61,code=sm_61 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_90,code=sm_90"

				  if [[ -n "$OVERRIDE_GENCODE" ]]; then

				      export GENCODE=$OVERRIDE_GENCODE

				  fi

				  if [[ -n "$OVERRIDE_GENCODE_CUDNN" ]]; then

				      export GENCODE_CUDNN=$OVERRIDE_GENCODE_CUDNN

				  fi

				  # all CUDA libs except CuDNN and CuBLAS

				  ls $CUDA_LIB_DIR/ | grep "\.a" | grep -v "culibos" | grep -v "cudart" | grep -v "cudnn" | grep -v "cublas" | grep -v "metis"  \

				      | xargs -I {} bash -c \

				                "echo {} && $NVPRUNE $GENCODE $CUDA_LIB_DIR/{} -o $CUDA_LIB_DIR/{}"

				  # prune CuDNN and CuBLAS

				  $NVPRUNE $GENCODE_CUDNN $CUDA_LIB_DIR/libcublas_static.a -o $CUDA_LIB_DIR/libcublas_static.a

				  $NVPRUNE $GENCODE_CUDNN $CUDA_LIB_DIR/libcublasLt_static.a -o $CUDA_LIB_DIR/libcublasLt_static.a

				  #####################################################################################

				  # CUDA 12.6 prune visual tools

				  #####################################################################################

				  export CUDA_BASE="/usr/local/cuda-12.6/"

				  rm -rf $CUDA_BASE/libnvvp $CUDA_BASE/nsightee_plugins $CUDA_BASE/nsight-compute-2024.3.2 $CUDA_BASE/nsight-systems-2024.5.1/

				}

				# idiomatic parameter and option handling in sh

				while test $# -gt 0

				do

				@ -243,6 +323,8 @@ do

				        ;;

				    12.4) install_124; prune_124

				        ;;

				    12.6) install_126; prune_126

				        ;;

				    *) echo "bad argument $1"; exit 1

				        ;;

				    esac

									
										106

.ci/docker/common/install_cuda_aarch64.sh
									
												View File
												
				@ -4,20 +4,33 @@

				set -ex

				NCCL_VERSION=v2.21.5-1

				CUDNN_VERSION=9.5.1.17

				function install_cusparselt_052 {

				function install_cusparselt_062 {

				    # cuSparseLt license: https://docs.nvidia.com/cuda/cusparselt/license.html

				    mkdir tmp_cusparselt && pushd tmp_cusparselt

				    wget -q https://developer.download.nvidia.com/compute/cusparselt/redist/libcusparse_lt/linux-sbsa/libcusparse_lt-linux-sbsa-0.5.2.1-archive.tar.xz

				    tar xf libcusparse_lt-linux-sbsa-0.5.2.1-archive.tar.xz

				    cp -a libcusparse_lt-linux-sbsa-0.5.2.1-archive/include/* /usr/local/cuda/include/

				    cp -a libcusparse_lt-linux-sbsa-0.5.2.1-archive/lib/* /usr/local/cuda/lib64/

				    wget -q https://developer.download.nvidia.com/compute/cusparselt/redist/libcusparse_lt/linux-sbsa/libcusparse_lt-linux-sbsa-0.6.2.3-archive.tar.xz

				    tar xf libcusparse_lt-linux-sbsa-0.6.2.3-archive.tar.xz

				    cp -a libcusparse_lt-linux-sbsa-0.6.2.3-archive/include/* /usr/local/cuda/include/

				    cp -a libcusparse_lt-linux-sbsa-0.6.2.3-archive/lib/* /usr/local/cuda/lib64/

				    popd

				    rm -rf tmp_cusparselt

				}

				function install_cusparselt_063 {

				    # cuSparseLt license: https://docs.nvidia.com/cuda/cusparselt/license.html

				    mkdir tmp_cusparselt && pushd tmp_cusparselt

				    wget -q https://developer.download.nvidia.com/compute/cusparselt/redist/libcusparse_lt/linux-sbsa/libcusparse_lt-linux-sbsa-0.6.3.2-archive.tar.xz

				    tar xf libcusparse_lt-linux-sbsa-0.6.3.2-archive.tar.xz

				    cp -a libcusparse_lt-linux-sbsa-0.6.3.2-archive/include/* /usr/local/cuda/include/

				    cp -a libcusparse_lt-linux-sbsa-0.6.3.2-archive/lib/* /usr/local/cuda/lib64/

				    popd

				    rm -rf tmp_cusparselt

				}

				function install_124 {

				  echo "Installing CUDA 12.4.1 and cuDNN 9.1 and NCCL ${NCCL_VERSION} and cuSparseLt-0.5.2"

				  CUDNN_VERSION=9.1.0.70

				  echo "Installing CUDA 12.4.1 and cuDNN ${CUDNN_VERSION} and NCCL ${NCCL_VERSION} and cuSparseLt-0.6.2"

				  rm -rf /usr/local/cuda-12.4 /usr/local/cuda

				  # install CUDA 12.4.1 in the same container

				  wget -q https://developer.download.nvidia.com/compute/cuda/12.4.1/local_installers/cuda_12.4.1_550.54.15_linux_sbsa.run

				@ -28,10 +41,10 @@ function install_124 {

				  # cuDNN license: https://developer.nvidia.com/cudnn/license_agreement

				  mkdir tmp_cudnn && cd tmp_cudnn

				  wget -q https://developer.download.nvidia.com/compute/cudnn/redist/cudnn/linux-sbsa/cudnn-linux-sbsa-9.1.0.70_cuda12-archive.tar.xz -O cudnn-linux-sbsa-9.1.0.70_cuda12-archive.tar.xz

				  tar xf cudnn-linux-sbsa-9.1.0.70_cuda12-archive.tar.xz

				  cp -a cudnn-linux-sbsa-9.1.0.70_cuda12-archive/include/* /usr/local/cuda/include/

				  cp -a cudnn-linux-sbsa-9.1.0.70_cuda12-archive/lib/* /usr/local/cuda/lib64/

				  wget -q https://developer.download.nvidia.com/compute/cudnn/redist/cudnn/linux-sbsa/cudnn-linux-sbsa-${CUDNN_VERSION}_cuda12-archive.tar.xz -O cudnn-linux-sbsa-${CUDNN_VERSION}_cuda12-archive.tar.xz

				  tar xf cudnn-linux-sbsa-${CUDNN_VERSION}_cuda12-archive.tar.xz

				  cp -a cudnn-linux-sbsa-${CUDNN_VERSION}_cuda12-archive/include/* /usr/local/cuda/include/

				  cp -a cudnn-linux-sbsa-${CUDNN_VERSION}_cuda12-archive/lib/* /usr/local/cuda/lib64/

				  cd ..

				  rm -rf tmp_cudnn

				@ -44,7 +57,7 @@ function install_124 {

				  cd ..

				  rm -rf nccl

				  install_cusparselt_052

				  install_cusparselt_062

				  ldconfig

				}

				@ -74,18 +87,87 @@ function prune_124 {

				  $NVPRUNE $GENCODE_CUDNN $CUDA_LIB_DIR/libcublasLt_static.a -o $CUDA_LIB_DIR/libcublasLt_static.a

				  #####################################################################################

				  # CUDA 12.1 prune visual tools

				  # CUDA 12.4 prune visual tools

				  #####################################################################################

				  export CUDA_BASE="/usr/local/cuda-12.4/"

				  rm -rf $CUDA_BASE/libnvvp $CUDA_BASE/nsightee_plugins $CUDA_BASE/nsight-compute-2024.1.0 $CUDA_BASE/nsight-systems-2023.4.4/

				}

				function install_126 {

				  echo "Installing CUDA 12.6.3 and cuDNN ${CUDNN_VERSION} and NCCL ${NCCL_VERSION} and cuSparseLt-0.6.3"

				  rm -rf /usr/local/cuda-12.6 /usr/local/cuda

				  # install CUDA 12.6.3 in the same container

				  wget -q https://developer.download.nvidia.com/compute/cuda/12.6.3/local_installers/cuda_12.6.3_560.35.05_linux_sbsa.run

				  chmod +x cuda_12.6.3_560.35.05_linux_sbsa.run

				  ./cuda_12.6.3_560.35.05_linux_sbsa.run --toolkit --silent

				  rm -f cuda_12.6.3_560.35.05_linux_sbsa.run

				  rm -f /usr/local/cuda && ln -s /usr/local/cuda-12.6 /usr/local/cuda

				  # cuDNN license: https://developer.nvidia.com/cudnn/license_agreement

				  mkdir tmp_cudnn && cd tmp_cudnn

				  wget -q https://developer.download.nvidia.com/compute/cudnn/redist/cudnn/linux-sbsa/cudnn-linux-sbsa-${CUDNN_VERSION}_cuda12-archive.tar.xz -O cudnn-linux-sbsa-${CUDNN_VERSION}_cuda12-archive.tar.xz

				  tar xf cudnn-linux-sbsa-${CUDNN_VERSION}_cuda12-archive.tar.xz

				  cp -a cudnn-linux-sbsa-${CUDNN_VERSION}_cuda12-archive/include/* /usr/local/cuda/include/

				  cp -a cudnn-linux-sbsa-${CUDNN_VERSION}_cuda12-archive/lib/* /usr/local/cuda/lib64/

				  cd ..

				  rm -rf tmp_cudnn

				  # NCCL license: https://docs.nvidia.com/deeplearning/nccl/#licenses

				  # Follow build: https://github.com/NVIDIA/nccl/tree/master?tab=readme-ov-file#build

				  git clone -b ${NCCL_VERSION} --depth 1 https://github.com/NVIDIA/nccl.git

				  cd nccl && make -j src.build

				  cp -a build/include/* /usr/local/cuda/include/

				  cp -a build/lib/* /usr/local/cuda/lib64/

				  cd ..

				  rm -rf nccl

				  install_cusparselt_063

				  ldconfig

				}

				function prune_126 {

				  echo "Pruning CUDA 12.6"

				  #####################################################################################

				  # CUDA 12.6 prune static libs

				  #####################################################################################

				  export NVPRUNE="/usr/local/cuda-12.6/bin/nvprune"

				  export CUDA_LIB_DIR="/usr/local/cuda-12.6/lib64"

				  export GENCODE="-gencode arch=compute_50,code=sm_50 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_90,code=sm_90"

				  export GENCODE_CUDNN="-gencode arch=compute_50,code=sm_50 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_61,code=sm_61 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_90,code=sm_90"

				  if [[ -n "$OVERRIDE_GENCODE" ]]; then

				      export GENCODE=$OVERRIDE_GENCODE

				  fi

				  if [[ -n "$OVERRIDE_GENCODE_CUDNN" ]]; then

				      export GENCODE_CUDNN=$OVERRIDE_GENCODE_CUDNN

				  fi

				  # all CUDA libs except CuDNN and CuBLAS

				  ls $CUDA_LIB_DIR/ | grep "\.a" | grep -v "culibos" | grep -v "cudart" | grep -v "cudnn" | grep -v "cublas" | grep -v "metis"  \

				      | xargs -I {} bash -c \

				                "echo {} && $NVPRUNE $GENCODE $CUDA_LIB_DIR/{} -o $CUDA_LIB_DIR/{}"

				  # prune CuDNN and CuBLAS

				  $NVPRUNE $GENCODE_CUDNN $CUDA_LIB_DIR/libcublas_static.a -o $CUDA_LIB_DIR/libcublas_static.a

				  $NVPRUNE $GENCODE_CUDNN $CUDA_LIB_DIR/libcublasLt_static.a -o $CUDA_LIB_DIR/libcublasLt_static.a

				  #####################################################################################

				  # CUDA 12.6 prune visual tools

				  #####################################################################################

				  export CUDA_BASE="/usr/local/cuda-12.6/"

				  rm -rf $CUDA_BASE/libnvvp $CUDA_BASE/nsightee_plugins $CUDA_BASE/nsight-compute-2024.3.2 $CUDA_BASE/nsight-systems-2024.5.1/

				}

				# idiomatic parameter and option handling in sh

				while test $# -gt 0

				do

				    case "$1" in

				    12.4) install_124; prune_124

				        ;;

				    12.6) install_126; prune_126

				        ;;

				    *) echo "bad argument $1"; exit 1

				        ;;

				    esac

									
										4

.ci/docker/common/install_cudnn.sh
									
												View File
												
				@ -4,7 +4,9 @@ if [[ -n "${CUDNN_VERSION}" ]]; then

				    # cuDNN license: https://developer.nvidia.com/cudnn/license_agreement

				    mkdir tmp_cudnn

				    pushd tmp_cudnn

				    if [[ ${CUDA_VERSION:0:2} == "12" ]]; then

				    if [[ ${CUDA_VERSION:0:4} == "12.6" ]]; then

				        CUDNN_NAME="cudnn-linux-x86_64-9.5.1.17_cuda12-archive"

				    elif [[ ${CUDA_VERSION:0:2} == "12" ]]; then

				        CUDNN_NAME="cudnn-linux-x86_64-9.1.0.70_cuda12-archive"

				    elif [[ ${CUDA_VERSION:0:2} == "11" ]]; then

				        CUDNN_NAME="cudnn-linux-x86_64-9.1.0.70_cuda11-archive"

									
										2

.ci/docker/common/install_cusparselt.sh
									
												View File
												
				@ -5,7 +5,7 @@ set -ex

				# cuSPARSELt license: https://docs.nvidia.com/cuda/cusparselt/license.html

				mkdir tmp_cusparselt && cd tmp_cusparselt

				if [[ ${CUDA_VERSION:0:4} =~ ^12\.[2-4]$ ]]; then

				if [[ ${CUDA_VERSION:0:4} =~ ^12\.[2-6]$ ]]; then

				    arch_path='sbsa'

				    export TARGETARCH=${TARGETARCH:-$(uname -m)}

				    if [ ${TARGETARCH} = 'amd64' ] || [ "${TARGETARCH}" = 'x86_64' ]; then

									
										12

.ci/docker/common/install_executorch.sh
									
												View File
												
				@ -36,25 +36,19 @@ install_conda_dependencies() {

				}

				install_pip_dependencies() {

				  pushd executorch/.ci/docker

				  # Install PyTorch CPU build beforehand to avoid installing the much bigger CUDA

				  # binaries later, ExecuTorch only needs CPU

				  pip_install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu

				  # Install all Python dependencies

				  pip_install -r requirements-ci.txt

				  pushd executorch

				  as_jenkins bash install_requirements.sh --pybind xnnpack

				  popd

				}

				setup_executorch() {

				  pushd executorch

				  # Setup swiftshader and Vulkan SDK which are required to build the Vulkan delegate

				  as_jenkins bash .ci/scripts/setup-vulkan-linux-deps.sh

				  export PYTHON_EXECUTABLE=python

				  export EXECUTORCH_BUILD_PYBIND=ON

				  export CMAKE_ARGS="-DEXECUTORCH_BUILD_XNNPACK=ON -DEXECUTORCH_BUILD_KERNELS_QUANTIZED=ON"

				  as_jenkins .ci/scripts/setup-linux.sh cmake

				  as_jenkins .ci/scripts/setup-linux.sh cmake || true

				  popd

				}

									
										10

.ci/docker/common/install_inductor_benchmark_deps.sh
									
												View File
												
				@ -7,14 +7,20 @@ source "$(dirname "${BASH_SOURCE[0]}")/common_utils.sh"

				function install_huggingface() {

				  local version

				  commit=$(get_pinned_commit huggingface)

				  pip_install pandas==2.0.3

				  pip_install "git+https://github.com/huggingface/transformers@${commit}"

				}

				function install_timm() {

				  local commit

				  commit=$(get_pinned_commit timm)

				  pip_install pandas==2.0.3

				  # TODO (huydhn): There is no torchvision release on 3.13 when I write this, so

				  # I'm using nightly here instead. We just need to package to be able to install

				  # TIMM. Removing this once vision has a release on 3.13

				  if [[ "${ANACONDA_PYTHON_VERSION}" == "3.13" ]]; then

				    pip_install --pre torch torchvision --index-url https://download.pytorch.org/whl/nightly/cu124

				  fi

				  pip_install "git+https://github.com/huggingface/pytorch-image-models@${commit}"

				  # Clean up

				  conda_run pip uninstall -y cmake torch torchvision triton

									
										4

.ci/docker/common/install_magma.sh
									
												View File
												
				@ -3,8 +3,6 @@

				set -eou pipefail

				MAGMA_VERSION="2.5.2"

				function do_install() {

				    cuda_version=$1

				    cuda_version_nodot=${1/./}

				@ -17,7 +15,7 @@ function do_install() {

				        set -x

				        tmp_dir=$(mktemp -d)

				        pushd ${tmp_dir}

				        curl -OLs https://anaconda.org/pytorch/magma-cuda${cuda_version_nodot}/${MAGMA_VERSION}/download/linux-64/${magma_archive}

				        curl -OLs https://ossci-linux.s3.us-east-1.amazonaws.com/${magma_archive}

				        tar -xvf "${magma_archive}"

				        mkdir -p "${cuda_dir}/magma"

				        mv include "${cuda_dir}/magma/include"

									
										26

.ci/docker/common/install_magma_conda.sh
									
										Executable file
									
												View File
												
				@ -0,0 +1,26 @@

				#!/usr/bin/env bash

				# Script that replaces the magma install from a conda package

				set -eou pipefail

				function do_install() {

				    cuda_version_nodot=${1/./}

				    anaconda_python_version=$2

				    MAGMA_VERSION="2.6.1"

				    magma_archive="magma-cuda${cuda_version_nodot}-${MAGMA_VERSION}-1.tar.bz2"

				    anaconda_dir="/opt/conda/envs/py_${anaconda_python_version}"

				    (

				        set -x

				        tmp_dir=$(mktemp -d)

				        pushd ${tmp_dir}

				        curl -OLs https://ossci-linux.s3.us-east-1.amazonaws.com/${magma_archive}

				        tar -xvf "${magma_archive}"

				        mv include/* "${anaconda_dir}/include/"

				        mv lib/* "${anaconda_dir}/lib"

				        popd

				    )

				}

				do_install $1 $2

									
										102

.ci/docker/common/install_miopen.sh
									
												View File
												
				@ -10,6 +10,21 @@ if [[ -z $ROCM_VERSION ]]; then

				    exit 1;

				fi

				IS_UBUNTU=0

				ID=$(grep -oP '(?<=^ID=).+' /etc/os-release | tr -d '"')

				case "$ID" in

				  ubuntu)

				    IS_UBUNTU=1

				    ;;

				  centos|almalinux)

				    IS_UBUNTU=0

				    ;;

				  *)

				    echo "Unable to determine OS..."

				    exit 1

				    ;;

				esac

				# To make version comparison easier, create an integer representation.

				save_IFS="$IFS"

				IFS=. ROCM_VERSION_ARRAY=(${ROCM_VERSION})

				@ -28,12 +43,6 @@ else

				fi

				ROCM_INT=$(($ROCM_VERSION_MAJOR * 10000 + $ROCM_VERSION_MINOR * 100 + $ROCM_VERSION_PATCH))

				# Install custom MIOpen + COMgr for ROCm >= 4.0.1

				if [[ $ROCM_INT -lt 40001 ]]; then

				    echo "ROCm version < 4.0.1; will not install custom MIOpen"

				    exit 0

				fi

				# Function to retry functions that sometimes timeout or have flaky failures

				retry () {

				    $*  || (sleep 1 && $*) || (sleep 2 && $*) || (sleep 4 && $*) || (sleep 8 && $*)

				@ -51,70 +60,49 @@ else

				    ROCM_INSTALL_PATH="/opt/rocm-${ROCM_VERSION}"

				fi

				# MIOPEN_USE_HIP_KERNELS is a Workaround for COMgr issues

				MIOPEN_CMAKE_COMMON_FLAGS="

				-DMIOPEN_USE_COMGR=ON

				-DMIOPEN_BUILD_DRIVER=OFF

				"

				# Pull MIOpen repo and set DMIOPEN_EMBED_DB based on ROCm version

				if [[ $ROCM_INT -ge 60200 ]] && [[ $ROCM_INT -lt 60300 ]]; then

				    echo "ROCm 6.2 MIOpen does not need any patches, do not build from source"

				    exit 0

				elif [[ $ROCM_INT -ge 60100 ]] && [[ $ROCM_INT -lt 60200 ]]; then

				    echo "ROCm 6.1 MIOpen does not need any patches, do not build from source"

				    exit 0

				elif [[ $ROCM_INT -ge 60000 ]] && [[ $ROCM_INT -lt 60100 ]]; then

				    echo "ROCm 6.0 MIOpen does not need any patches, do not build from source"

				    exit 0

				elif [[ $ROCM_INT -ge 50700 ]] && [[ $ROCM_INT -lt 60000 ]]; then

				    echo "ROCm 5.7 MIOpen does not need any patches, do not build from source"

				    exit 0

				elif [[ $ROCM_INT -ge 50600 ]] && [[ $ROCM_INT -lt 50700 ]]; then

				    MIOPEN_BRANCH="release/rocm-rel-5.6-staging"

				elif [[ $ROCM_INT -ge 50500 ]] && [[ $ROCM_INT -lt 50600 ]]; then

				    MIOPEN_BRANCH="release/rocm-rel-5.5-gfx11"

				elif [[ $ROCM_INT -ge 50400 ]] && [[ $ROCM_INT -lt 50500 ]]; then

				    MIOPEN_CMAKE_DB_FLAGS="-DMIOPEN_EMBED_DB=gfx900_56;gfx906_60;gfx90878;gfx90a6e;gfx1030_36 -DMIOPEN_USE_MLIR=Off"

				    MIOPEN_BRANCH="release/rocm-rel-5.4-staging"

				elif [[ $ROCM_INT -ge 50300 ]] && [[ $ROCM_INT -lt 50400 ]]; then

				    MIOPEN_CMAKE_DB_FLAGS="-DMIOPEN_EMBED_DB=gfx900_56;gfx906_60;gfx90878;gfx90a6e;gfx1030_36 -DMIOPEN_USE_MLIR=Off"

				    MIOPEN_BRANCH="release/rocm-rel-5.3-staging"

				elif [[ $ROCM_INT -ge 50200 ]] && [[ $ROCM_INT -lt 50300 ]]; then

				    MIOPEN_CMAKE_DB_FLAGS="-DMIOPEN_EMBED_DB=gfx900_56;gfx906_60;gfx90878;gfx90a6e;gfx1030_36 -DMIOPEN_USE_MLIR=Off"

				    MIOPEN_BRANCH="release/rocm-rel-5.2-staging"

				elif [[ $ROCM_INT -ge 50100 ]] && [[ $ROCM_INT -lt 50200 ]]; then

				    MIOPEN_CMAKE_DB_FLAGS="-DMIOPEN_EMBED_DB=gfx900_56;gfx906_60;gfx90878;gfx90a6e;gfx1030_36"

				    MIOPEN_BRANCH="release/rocm-rel-5.1-staging"

				elif [[ $ROCM_INT -ge 50000 ]] && [[ $ROCM_INT -lt 50100 ]]; then

				    MIOPEN_CMAKE_DB_FLAGS="-DMIOPEN_EMBED_DB=gfx900_56;gfx906_60;gfx90878;gfx90a6e;gfx1030_36"

				    MIOPEN_BRANCH="release/rocm-rel-5.0-staging"

				if [[ $ROCM_INT -ge 60200 ]] && [[ $ROCM_INT -lt 60204 ]]; then

				    MIOPEN_BRANCH="release/rocm-rel-6.2-staging"

				else

				    echo "Unhandled ROCM_VERSION ${ROCM_VERSION}"

				    exit 1

				    echo "ROCm ${ROCM_VERSION} does not need any patches, do not build from source"

				    exit 0

				fi

				yum remove -y miopen-hip

				if [[ ${IS_UBUNTU} == 1 ]]; then

				  apt-get remove -y miopen-hip

				else

				  # Workaround since almalinux manylinux image already has this and cget doesn't like that

				  rm -rf /usr/local/lib/pkgconfig/sqlite3.pc

				  # Versioned package name needs regex match

				  # Use --noautoremove to prevent other rocm packages from being uninstalled

				  yum remove -y miopen-hip* --noautoremove

				fi

				git clone https://github.com/ROCm/MIOpen -b ${MIOPEN_BRANCH}

				pushd MIOpen

				# remove .git to save disk space since CI runner was running out

				rm -rf .git

				# Don't build MLIR to save docker build time

				# since we are disabling MLIR backend for MIOpen anyway

				if [[ $ROCM_INT -ge 50400 ]] && [[ $ROCM_INT -lt 50500 ]]; then

				    sed -i '/rocMLIR/d' requirements.txt

				elif [[ $ROCM_INT -ge 50200 ]] && [[ $ROCM_INT -lt 50400 ]]; then

				    sed -i '/llvm-project-mlir/d' requirements.txt

				fi

				# Don't build CK to save docker build time

				sed -i '/composable_kernel/d' requirements.txt

				## MIOpen minimum requirements

				cmake -P install_deps.cmake --minimum

				# clean up since CI runner was running out of disk space

				rm -rf /tmp/*

				yum clean all

				rm -rf /var/cache/yum

				rm -rf /var/lib/yum/yumdb

				rm -rf /var/lib/yum/history

				if [[ ${IS_UBUNTU} == 1 ]]; then

				  apt-get autoclean && apt-get clean

				  rm -rf /var/lib/apt/lists/* /tmp/* /var/tmp/*

				else

				  yum clean all

				  rm -rf /var/cache/yum

				  rm -rf /var/lib/yum/yumdb

				  rm -rf /var/lib/yum/history

				fi

				## Build MIOpen

				mkdir -p build

				@ -122,7 +110,7 @@ cd build

				PKG_CONFIG_PATH=/usr/local/lib/pkgconfig CXX=${ROCM_INSTALL_PATH}/llvm/bin/clang++ cmake .. \

				    ${MIOPEN_CMAKE_COMMON_FLAGS} \

				    ${MIOPEN_CMAKE_DB_FLAGS} \

				    -DCMAKE_PREFIX_PATH="${ROCM_INSTALL_PATH}/hip;${ROCM_INSTALL_PATH}"

				    -DCMAKE_PREFIX_PATH="${ROCM_INSTALL_PATH}"

				make MIOpen -j $(nproc)

				# Build MIOpen package

				@ -131,7 +119,11 @@ make -j $(nproc) package

				# clean up since CI runner was running out of disk space

				rm -rf /usr/local/cget

				yum install -y miopen-*.rpm

				if [[ ${IS_UBUNTU} == 1 ]]; then

				  sudo dpkg -i miopen-hip*.deb

				else

				  yum install -y miopen-*.rpm

				fi

				popd

				rm -rf MIOpen

									
										2

.ci/docker/common/install_onnx.sh
									
												View File
												
				@ -32,7 +32,7 @@ pip_install coloredlogs packaging

				pip_install onnxruntime==1.18.1

				pip_install onnx==1.16.2

				pip_install onnxscript==0.1.0.dev20240831 --no-deps

				pip_install onnxscript==0.1.0.dev20241124 --no-deps

				# required by onnxscript

				pip_install ml_dtypes

									
										2

.ci/docker/common/install_openblas.sh
									
												View File
												
				@ -4,7 +4,7 @@

				set -ex

				cd /

				git clone https://github.com/OpenMathLib/OpenBLAS.git -b v0.3.25 --depth 1 --shallow-submodules

				git clone https://github.com/OpenMathLib/OpenBLAS.git -b v0.3.28 --depth 1 --shallow-submodules

				OPENBLAS_BUILD_FLAGS="

									
										2

.ci/docker/common/install_rocm_drm.sh
									
												View File
												
				@ -12,7 +12,7 @@ case "$ID" in

				    apt-get install -y libpciaccess-dev pkg-config

				    apt-get clean

				    ;;

				  centos)

				  centos|almalinux)

				    yum install -y libpciaccess-devel pkgconfig

				    ;;

				  *)

									
										12

.ci/docker/common/install_rocm_magma.sh
									
												View File
												
				@ -3,6 +3,18 @@

				set -ex

				# Magma build scripts need `python`

				ln -sf /usr/bin/python3 /usr/bin/python

				ID=$(grep -oP '(?<=^ID=).+' /etc/os-release | tr -d '"')

				case "$ID" in

				  almalinux)

				    yum install -y gcc-gfortran

				    ;;

				  *)

				    echo "No preinstalls to build magma..."

				    ;;

				esac

				MKLROOT=${MKLROOT:-/opt/conda/envs/py_$ANACONDA_PYTHON_VERSION}

									
										13

.ci/docker/common/install_triton.sh
									
												View File
												
				@ -12,14 +12,14 @@ conda_reinstall() {

				  as_jenkins conda install -q -n py_$ANACONDA_PYTHON_VERSION -y --force-reinstall $*

				}

				if [ -n "${ROCM_VERSION}" ]; then

				  TRITON_REPO="https://github.com/openai/triton"

				  TRITON_TEXT_FILE="triton-rocm"

				elif [ -n "${XPU_VERSION}" ]; then

				if [ -n "${XPU_VERSION}" ]; then

				  TRITON_REPO="https://github.com/intel/intel-xpu-backend-for-triton"

				  TRITON_TEXT_FILE="triton-xpu"

				elif [ -n "${TRITON_CPU}" ]; then

				  TRITON_REPO="https://github.com/triton-lang/triton-cpu"

				  TRITON_TEXT_FILE="triton-cpu"

				else

				  TRITON_REPO="https://github.com/openai/triton"

				  TRITON_REPO="https://github.com/triton-lang/triton"

				  TRITON_TEXT_FILE="triton"

				fi

				@ -47,9 +47,10 @@ chown -R jenkins /var/lib/jenkins/triton

				chgrp -R jenkins /var/lib/jenkins/triton

				pushd /var/lib/jenkins/

				as_jenkins git clone ${TRITON_REPO} triton

				as_jenkins git clone --recursive ${TRITON_REPO} triton

				cd triton

				as_jenkins git checkout ${TRITON_PINNED_COMMIT}

				as_jenkins git submodule update --init --recursive

				cd python

				# TODO: remove patch setup.py once we have a proper fix for https://github.com/triton-lang/triton/issues/4527

									
										7

.ci/docker/common/install_user.sh
									
												View File
												
				@ -2,6 +2,13 @@

				set -ex

				# Since version 24 the system ships with user 'ubuntu' that has id 1000

				# We need a work-around to enable id 1000 usage for this script

				if [[ $UBUNTU_VERSION == 24.04 ]]; then

				    # touch is used to disable harmless error message

				    touch /var/mail/ubuntu && chown ubuntu /var/mail/ubuntu && userdel -r ubuntu

				fi

				# Mirror jenkins user in container

				# jenkins user as ec2-user should have the same user-id

				echo "jenkins:x:1000:1000::/var/lib/jenkins:" >> /etc/passwd

									
										42

.ci/docker/common/install_xpu.sh
									
												View File
												
				@ -24,10 +24,10 @@ function install_ubuntu() {

				        | tee /etc/apt/sources.list.d/intel-gpu-${VERSION_CODENAME}.list

				    # To add the online network network package repository for the Intel Support Packages

				    wget -O- https://apt.repos.intel.com/intel-gpg-keys/GPG-PUB-KEY-INTEL-SW-PRODUCTS.PUB \

				        | gpg --dearmor > /usr/share/keyrings/intel-for-pytorch-gpu-dev-keyring.gpg

				    echo "deb [signed-by=/usr/share/keyrings/intel-for-pytorch-gpu-dev-keyring.gpg] \

				        https://apt.repos.intel.com/intel-for-pytorch-gpu-dev all main" \

				        | tee /etc/apt/sources.list.d/intel-for-pytorch-gpu-dev.list

				        | gpg --dearmor > /usr/share/keyrings/oneapi-archive-keyring.gpg.gpg

				    echo "deb [signed-by=/usr/share/keyrings/oneapi-archive-keyring.gpg.gpg] \

				        https://apt.repos.intel.com/${XPU_REPO_NAME} all main" \

				        | tee /etc/apt/sources.list.d/oneAPI.list

				    # Update the packages list and repository index

				    apt-get update

				@ -41,14 +41,13 @@ function install_ubuntu() {

				        libegl-mesa0 libegl1-mesa libegl1-mesa-dev libgbm1 libgl1-mesa-dev libgl1-mesa-dri \

				        libglapi-mesa libgles2-mesa-dev libglx-mesa0 libigdgmm12 libxatracker2 mesa-va-drivers \

				        mesa-vdpau-drivers mesa-vulkan-drivers va-driver-all vainfo hwinfo clinfo

				    if [[ "${XPU_DRIVER_TYPE,,}" == "rolling" ]]; then

				        apt-get install -y intel-ocloc

				    fi

				    # Development Packages

				    apt-get install -y libigc-dev intel-igc-cm libigdfcl-dev libigfxcmrt-dev level-zero-dev

				    # Install Intel Support Packages

				    if [ -n "$XPU_VERSION" ]; then

				        apt-get install -y intel-for-pytorch-gpu-dev-${XPU_VERSION} intel-pti-dev

				    else

				        apt-get install -y intel-for-pytorch-gpu-dev intel-pti-dev

				    fi

				    apt-get install -y ${XPU_PACKAGES}

				    # Cleanup

				    apt-get autoclean && apt-get clean

				@ -58,13 +57,13 @@ function install_ubuntu() {

				function install_rhel() {

				    . /etc/os-release

				    if [[ "${ID}" == "rhel" ]]; then

				        if [[ ! " 8.6 8.8 8.9 9.0 9.2 9.3 " =~ " ${VERSION_ID} " ]]; then

				        if [[ ! " 8.8 8.9 9.0 9.2 9.3 " =~ " ${VERSION_ID} " ]]; then

				            echo "RHEL version ${VERSION_ID} not supported"

				            exit

				        fi

				    elif [[ "${ID}" == "almalinux" ]]; then

				        # Workaround for almalinux8 which used by quay.io/pypa/manylinux_2_28_x86_64

				        VERSION_ID="8.6"

				        VERSION_ID="8.8"

				    fi

				    dnf install -y 'dnf-command(config-manager)'

				@ -72,16 +71,18 @@ function install_rhel() {

				    dnf config-manager --add-repo \

				        https://repositories.intel.com/gpu/rhel/${VERSION_ID}${XPU_DRIVER_VERSION}/unified/intel-gpu-${VERSION_ID}.repo

				    # To add the online network network package repository for the Intel Support Packages

				    tee > /etc/yum.repos.d/intel-for-pytorch-gpu-dev.repo << EOF

				[intel-for-pytorch-gpu-dev]

				    tee > /etc/yum.repos.d/oneAPI.repo << EOF

				[oneAPI]

				name=Intel for Pytorch GPU dev repository

				baseurl=https://yum.repos.intel.com/intel-for-pytorch-gpu-dev

				baseurl=https://yum.repos.intel.com/${XPU_REPO_NAME}

				enabled=1

				gpgcheck=1

				repo_gpgcheck=1

				gpgkey=https://yum.repos.intel.com/intel-gpg-keys/GPG-PUB-KEY-INTEL-SW-PRODUCTS.PUB

				EOF

				    # Install Intel Support Packages

				    yum install -y ${XPU_PACKAGES}

				    # The xpu-smi packages

				    dnf install -y xpu-smi

				    # Compute and Media Runtimes

				@ -96,8 +97,6 @@ EOF

				    dnf install -y --refresh \

				        intel-igc-opencl-devel level-zero-devel intel-gsc-devel libmetee-devel \

				        level-zero-devel

				    # Install Intel Support Packages

				    yum install -y intel-for-pytorch-gpu-dev intel-pti-dev

				    # Cleanup

				    dnf clean all

				@ -119,7 +118,7 @@ function install_sles() {

				        https://repositories.intel.com/gpu/sles/${VERSION_SP}${XPU_DRIVER_VERSION}/unified/intel-gpu-${VERSION_SP}.repo

				    rpm --import https://repositories.intel.com/gpu/intel-graphics.key

				    # To add the online network network package repository for the Intel Support Packages

				    zypper addrepo https://yum.repos.intel.com/intel-for-pytorch-gpu-dev intel-for-pytorch-gpu-dev

				    zypper addrepo https://yum.repos.intel.com/${XPU_REPO_NAME} oneAPI

				    rpm --import https://yum.repos.intel.com/intel-gpg-keys/GPG-PUB-KEY-INTEL-SW-PRODUCTS.PUB

				    # The xpu-smi packages

				@ -131,7 +130,7 @@ function install_sles() {

				    zypper install -y libigdfcl-devel intel-igc-cm libigfxcmrt-devel level-zero-devel

				    # Install Intel Support Packages

				    zypper install -y intel-for-pytorch-gpu-dev intel-pti-dev

				    zypper install -y ${XPU_PACKAGES}

				}

				@ -142,6 +141,13 @@ if [[ "${XPU_DRIVER_TYPE,,}" == "rolling" ]]; then

				    XPU_DRIVER_VERSION=""

				fi

				XPU_REPO_NAME="intel-for-pytorch-gpu-dev"

				XPU_PACKAGES="intel-for-pytorch-gpu-dev-0.5 intel-pti-dev-0.9"

				if [[ "$XPU_VERSION" == "2025.0" ]]; then

				    XPU_REPO_NAME="oneapi"

				    XPU_PACKAGES="intel-deep-learning-essentials-2025.0"

				fi

				# The installation depends on the base OS

				ID=$(grep -oP '(?<=^ID=).+' /etc/os-release | tr -d '"')

				case "$ID" in

									
										5

.ci/docker/libtorch/Dockerfile
									
												View File
												
				@ -66,6 +66,11 @@ RUN bash ./install_cuda.sh 12.4

				RUN bash ./install_magma.sh 12.4

				RUN ln -sf /usr/local/cuda-12.4 /usr/local/cuda

				FROM cuda as cuda12.6

				RUN bash ./install_cuda.sh 12.6

				RUN bash ./install_magma.sh 12.6

				RUN ln -sf /usr/local/cuda-12.6 /usr/local/cuda

				FROM cpu as rocm

				ARG PYTORCH_ROCM_ARCH

				ENV PYTORCH_ROCM_ARCH ${PYTORCH_ROCM_ARCH}

									
										12

.ci/docker/libtorch/build.sh
									
												View File
												
				@ -39,17 +39,7 @@ case ${GPU_ARCH_TYPE} in

				        BASE_TARGET=rocm

				        DOCKER_TAG=rocm${GPU_ARCH_VERSION}

				        GPU_IMAGE=rocm/dev-ubuntu-20.04:${GPU_ARCH_VERSION}-complete

				        PYTORCH_ROCM_ARCH="gfx900;gfx906;gfx908;gfx90a;gfx1030;gfx1100"

				        ROCM_REGEX="([0-9]+)\.([0-9]+)[\.]?([0-9]*)"

				        if [[ $GPU_ARCH_VERSION =~ $ROCM_REGEX ]]; then

				            ROCM_VERSION_INT=$((${BASH_REMATCH[1]}*10000 + ${BASH_REMATCH[2]}*100 + ${BASH_REMATCH[3]:-0}))

				        else

				            echo "ERROR: rocm regex failed"

				            exit 1

				        fi

				        if [[ $ROCM_VERSION_INT -ge 60000 ]]; then

				            PYTORCH_ROCM_ARCH+=";gfx942"

				        fi

				        PYTORCH_ROCM_ARCH="gfx900;gfx906;gfx908;gfx90a;gfx1030;gfx1100;gfx1101;gfx942"

				        DOCKER_GPU_BUILD_ARG="--build-arg PYTORCH_ROCM_ARCH=${PYTORCH_ROCM_ARCH}"

				        ;;

				    *)

									
										3

.ci/docker/linter-cuda/Dockerfile
									
												View File
												
				@ -25,7 +25,8 @@ ENV PATH /opt/conda/envs/py_$ANACONDA_PYTHON_VERSION/bin:/opt/conda/bin:$PATH

				COPY requirements-ci.txt /opt/conda/requirements-ci.txt

				COPY ./common/install_conda.sh install_conda.sh

				COPY ./common/common_utils.sh common_utils.sh

				RUN bash ./install_conda.sh && rm install_conda.sh common_utils.sh /opt/conda/requirements-ci.txt

				COPY ./common/install_magma_conda.sh install_magma_conda.sh

				RUN bash ./install_conda.sh && rm install_conda.sh install_magma_conda.sh common_utils.sh /opt/conda/requirements-ci.txt

				# Install cuda and cudnn

				ARG CUDA_VERSION

									
										5

.ci/docker/manywheel/Dockerfile
									
												View File
												
				@ -10,6 +10,7 @@ ENV LANG en_US.UTF-8

				ENV LANGUAGE en_US.UTF-8

				ARG DEVTOOLSET_VERSION=9

				# Note: This is required patch since CentOS have reached EOL

				# otherwise any yum install setp will fail

				RUN sed -i s/mirror.centos.org/vault.centos.org/g /etc/yum.repos.d/*.repo

				@ -143,6 +144,10 @@ COPY --from=libpng             /usr/local/lib/pkgconfig              /usr/local/

				FROM common as cpu_final

				ARG BASE_CUDA_VERSION=10.1

				ARG DEVTOOLSET_VERSION=9

				# Install Anaconda

				ADD ./common/install_conda_docker.sh install_conda.sh

				RUN bash ./install_conda.sh && rm install_conda.sh

				ENV PATH /opt/conda/bin:$PATH

				RUN sed -i s/mirror.centos.org/vault.centos.org/g /etc/yum.repos.d/*.repo

				RUN sed -i s/^#.*baseurl=http/baseurl=http/g /etc/yum.repos.d/*.repo

				RUN sed -i s/^mirrorlist=http/#mirrorlist=http/g /etc/yum.repos.d/*.repo

47

.ci/docker/manywheel/Dockerfile_2_28

View File

 @ -1,5 +1,4 @@
 # syntax = docker/dockerfile:experimental
 ARG ROCM_VERSION=3.7
 ARG BASE_CUDA_VERSION=11.8
 ARG GPU_IMAGE=amd64/almalinux:8
 FROM quay.io/pypa/manylinux_2_28_x86_64 as base
 @ -117,30 +116,49 @@ COPY --from=jni                /usr/local/include/jni.h              /usr/local/
 FROM common as cpu_final
 ARG BASE_CUDA_VERSION=11.8
 ARG DEVTOOLSET_VERSION=11
 # Install Anaconda
 ADD ./common/install_conda_docker.sh install_conda.sh
 RUN bash ./install_conda.sh && rm install_conda.sh
 ENV PATH /opt/conda/bin:$PATH
 # Ensure the expected devtoolset is used
 ENV PATH=/opt/rh/gcc-toolset-${DEVTOOLSET_VERSION}/root/usr/bin:$PATH
 ENV LD_LIBRARY_PATH=/opt/rh/gcc-toolset-${DEVTOOLSET_VERSION}/root/usr/lib64:/opt/rh/gcc-toolset-${DEVTOOLSET_VERSION}/root/usr/lib:$LD_LIBRARY_PATH
 # Install setuptools and wheel for python 3.12/3.13
 RUN for cpython_version in "cp312-cp312" "cp313-cp313" "cp313-cp313t"; do \
     /opt/python/${cpython_version}/bin/python -m pip install setuptools wheel; \
     done;
 # cmake-3.18.4 from pip
 # cmake-3.18.4 from pip; force in case cmake3 already exists
 RUN yum install -y python3-pip && \
     python3 -mpip install cmake==3.18.4 && \
     ln -s /usr/local/bin/cmake /usr/bin/cmake3
     ln -sf /usr/local/bin/cmake /usr/bin/cmake3
 FROM cpu_final as cuda_final
 RUN rm -rf /usr/local/cuda-${BASE_CUDA_VERSION}
 COPY --from=cuda     /usr/local/cuda-${BASE_CUDA_VERSION}  /usr/local/cuda-${BASE_CUDA_VERSION}
 COPY --from=magma    /usr/local/cuda-${BASE_CUDA_VERSION}  /usr/local/cuda-${BASE_CUDA_VERSION}
 RUN ln -sf /usr/local/cuda-${BASE_CUDA_VERSION} /usr/local/cuda
 ENV PATH=/usr/local/cuda/bin:$PATH
 FROM common as rocm_final
 ARG ROCM_VERSION=3.7
 # Install ROCm
 ADD ./common/install_rocm.sh install_rocm.sh
 RUN bash ./install_rocm.sh ${ROCM_VERSION} && rm install_rocm.sh
 # cmake is already installed inside the rocm base image, but both 2 and 3 exist
 # cmake3 is needed for the later MIOpen custom build, so that step is last.
 RUN yum install -y cmake3 && \
     rm -f /usr/bin/cmake && \
     ln -s /usr/bin/cmake3 /usr/bin/cmake
 FROM cpu_final as rocm_final
 ARG ROCM_VERSION=6.0
 ARG PYTORCH_ROCM_ARCH
 ENV PYTORCH_ROCM_ARCH ${PYTORCH_ROCM_ARCH}
 ARG DEVTOOLSET_VERSION=11
 ENV LDFLAGS="-Wl,-rpath=/opt/rh/gcc-toolset-${DEVTOOLSET_VERSION}/root/usr/lib64 -Wl,-rpath=/opt/rh/gcc-toolset-${DEVTOOLSET_VERSION}/root/usr/lib"
 # Somewhere in ROCm stack, we still use non-existing /opt/rocm/hip path,
 # below workaround helps avoid error
 ENV ROCM_PATH /opt/rocm
 # cmake-3.28.4 from pip to get enable_language(HIP)
 # and avoid 3.21.0 cmake+ninja issues with ninja inserting "-Wl,--no-as-needed" in LINK_FLAGS for static linker
 RUN python3 -m pip install --upgrade pip && \
     python3 -mpip install cmake==3.28.4
 ADD ./common/install_rocm_drm.sh install_rocm_drm.sh
 RUN bash ./install_rocm_drm.sh && rm install_rocm_drm.sh
 ENV MKLROOT /opt/intel
 ADD ./common/install_rocm_magma.sh install_rocm_magma.sh
 RUN bash ./install_rocm_magma.sh && rm install_rocm_magma.sh
 ADD ./common/install_miopen.sh install_miopen.sh
 RUN bash ./install_miopen.sh ${ROCM_VERSION} && rm install_miopen.sh
 @ -150,8 +168,7 @@ ENV XPU_DRIVER_TYPE ROLLING
 # cmake-3.28.4 from pip
 RUN python3 -m pip install --upgrade pip && \
     python3 -mpip install cmake==3.28.4
 # Install setuptools and wheel for python 3.13
 RUN /opt/python/cp313-cp313/bin/python -m pip install setuptools wheel
 ADD ./common/install_xpu.sh install_xpu.sh
 ENV XPU_VERSION 2025.0
 RUN bash ./install_xpu.sh && rm install_xpu.sh
 RUN pushd /opt/_internal && tar -xJf static-libs-for-embedding-only.tar.xz && popd

7

.ci/docker/manywheel/Dockerfile_2_28_aarch64

View File

 @ -48,6 +48,11 @@ ENV LD_LIBRARY_PATH=/opt/rh/gcc-toolset-${GCCTOOLSET_VERSION}/root/usr/lib64:/op
 # For more details see https://github.com/pytorch/pytorch/issues/78659#issuecomment-1144107327
 RUN git config --global --add safe.directory "*"
 FROM base as openblas
 # Install openblas
 ADD ./common/install_openblas.sh install_openblas.sh
 RUN bash ./install_openblas.sh && rm install_openblas.sh
 FROM base as final
 # remove unncessary python versions
 @ -55,3 +60,5 @@ RUN rm -rf /opt/python/cp26-cp26m /opt/_internal/cpython-2.6.9-ucs2
 RUN rm -rf /opt/python/cp26-cp26mu /opt/_internal/cpython-2.6.9-ucs4
 RUN rm -rf /opt/python/cp33-cp33m /opt/_internal/cpython-3.3.6
 RUN rm -rf /opt/python/cp34-cp34m /opt/_internal/cpython-3.4.6
 COPY --from=openblas     /opt/OpenBLAS/  /opt/OpenBLAS/
 ENV LD_LIBRARY_PATH=/opt/OpenBLAS/lib:$LD_LIBRARY_PATH

2

.ci/docker/manywheel/Dockerfile_aarch64

View File

 @ -61,7 +61,7 @@ RUN git config --global --add safe.directory "*"
 # NOTE: Need a better way to get this library as Ubuntu's package can be removed by the vender, or changed
 ###############################################################################
 RUN cd ~/ \
   && curl -L -o ~/libgfortran-10-dev.deb http://ports.ubuntu.com/ubuntu-ports/pool/universe/g/gcc-10/libgfortran-10-dev_10.5.0-1ubuntu1_arm64.deb \
   && curl -L -o ~/libgfortran-10-dev.deb http://ports.ubuntu.com/ubuntu-ports/pool/universe/g/gcc-10/libgfortran-10-dev_10.5.0-4ubuntu2_arm64.deb \
   && ar x ~/libgfortran-10-dev.deb \
   && tar --use-compress-program=unzstd -xvf data.tar.zst -C ~/ \
   && cp -f ~/usr/lib/gcc/aarch64-linux-gnu/10/libgfortran.a /opt/rh/devtoolset-10/root/usr/lib/gcc/aarch64-redhat-linux/10/

99

.ci/docker/manywheel/Dockerfile_s390x

View File

 @ -1,17 +1,20 @@
 FROM --platform=linux/s390x docker.io/ubuntu:24.04 as base
 FROM quay.io/pypa/manylinux_2_28_s390x as base
 # Language variables
 ENV LC_ALL=C.UTF-8
 ENV LANG=C.UTF-8
 ENV LANGUAGE=C.UTF-8
 ARG DEVTOOLSET_VERSION=13
 # Installed needed OS packages. This is to support all
 # the binary builds (torch, vision, audio, text, data)
 RUN apt update ; apt upgrade -y
 RUN apt install -y \
   build-essential \
 RUN yum -y install epel-release
 RUN yum -y update
 RUN yum install -y \
   sudo \
   autoconf \
   automake \
   bison \
   bzip2 \
   curl \
   diffutils \
 @ -24,19 +27,40 @@ RUN apt install -y \
   util-linux \
   wget \
   which \
   xz-utils \
   xz \
   yasm \
   less \
   zstd \
   libgomp \
   gcc-toolset-${DEVTOOLSET_VERSION}-gcc \
   gcc-toolset-${DEVTOOLSET_VERSION}-gcc-c++ \
   gcc-toolset-${DEVTOOLSET_VERSION}-binutils \
   gcc-toolset-${DEVTOOLSET_VERSION}-gcc-gfortran \
   cmake \
   python3 \
   python3-dev \
   python3-setuptools \
   python3-yaml \
   python3-typing-extensions \
   libblas-dev \
   libopenblas-dev \
   liblapack-dev \
   libatlas-base-dev
   rust \
   cargo \
   llvm-devel \
   libzstd-devel \
   python3.12-devel \
   python3.12-setuptools \
   python3.12-pip \
   python3-virtualenv \
   python3.12-pyyaml \
   python3.12-numpy \
   python3.12-wheel \
   python3.12-cryptography \
   blas-devel \
   openblas-devel \
   lapack-devel \
   atlas-devel \
   libjpeg-devel \
   libxslt-devel \
   libxml2-devel \
   openssl-devel \
   valgrind
 ENV PATH=/opt/rh/gcc-toolset-${DEVTOOLSET_VERSION}/root/usr/bin:$PATH
 ENV LD_LIBRARY_PATH=/opt/rh/gcc-toolset-${DEVTOOLSET_VERSION}/root/usr/lib64:/opt/rh/gcc-toolset-${DEVTOOLSET_VERSION}/root/usr/lib:$LD_LIBRARY_PATH
 # git236+ would refuse to run git commands in repos owned by other users
 # Which causes version check to fail, as pytorch repo is bind-mounted into the image
 @ -44,14 +68,8 @@ RUN apt install -y \
 # For more details see https://github.com/pytorch/pytorch/issues/78659#issuecomment-1144107327
 RUN git config --global --add safe.directory "*"
 FROM base as openssl
 # Install openssl (this must precede `build python` step)
 # (In order to have a proper SSL module, Python is compiled
 # against a recent openssl [see env vars above], which is linked
 # statically. We delete openssl afterwards.)
 ADD ./common/install_openssl.sh install_openssl.sh
 RUN bash ./install_openssl.sh && rm install_openssl.sh
 ENV SSL_CERT_FILE=/opt/_internal/certs.pem
 # installed python doesn't have development parts. Rebuild it from scratch
 RUN /bin/rm -rf /opt/_internal /opt/python /usr/local/*/*
 # EPEL for cmake
 FROM base as patchelf
 @ -64,10 +82,43 @@ FROM patchelf as python
 # build python
 COPY manywheel/build_scripts /build_scripts
 ADD ./common/install_cpython.sh /build_scripts/install_cpython.sh
 ENV SSL_CERT_FILE=
 RUN bash build_scripts/build.sh && rm -r build_scripts
 FROM openssl as final
 FROM base as final
 COPY --from=python             /opt/python                           /opt/python
 COPY --from=python             /opt/_internal                        /opt/_internal
 COPY --from=python             /opt/python/cp39-cp39/bin/auditwheel /usr/local/bin/auditwheel
 COPY --from=python             /opt/python/cp39-cp39/bin/auditwheel  /usr/local/bin/auditwheel
 COPY --from=patchelf           /usr/local/bin/patchelf               /usr/local/bin/patchelf
 RUN alternatives --set python /usr/bin/python3.12
 RUN alternatives --set python3 /usr/bin/python3.12
 RUN pip-3.12 install typing_extensions
 ENTRYPOINT []
 CMD ["/bin/bash"]
 # install test dependencies:
 # - grpcio requires system openssl, bundled crypto fails to build
 # - ml_dtypes 0.4.0 requires some fixes provided in later commits to build
 RUN dnf install -y \
   protobuf-devel \
   protobuf-c-devel \
   protobuf-lite-devel \
   wget \
   patch
 RUN env GRPC_PYTHON_BUILD_SYSTEM_OPENSSL=True pip3 install grpcio==1.65.4
 RUN cd ~ && \
   git clone https://github.com/jax-ml/ml_dtypes && \
   cd ml_dtypes && \
   git checkout v0.4.0 && \
   git submodule update --init --recursive && \
   wget https://github.com/jax-ml/ml_dtypes/commit/b969f76914d6b30676721bc92bf0f6021a0d1321.patch && \
   wget https://github.com/jax-ml/ml_dtypes/commit/d4e6d035ecda073eab8bcf60f4eef572ee7087e6.patch && \
   patch -p1 < b969f76914d6b30676721bc92bf0f6021a0d1321.patch && \
   patch -p1 < d4e6d035ecda073eab8bcf60f4eef572ee7087e6.patch && \
   python3 setup.py bdist_wheel && \
   pip3 install dist/*.whl && \
   rm -rf ml_dtypes

									
										33

.ci/docker/manywheel/build.sh
									
												View File
												
				@ -61,7 +61,7 @@ case ${GPU_ARCH_TYPE} in

				    cpu-s390x)

				        TARGET=final

				        DOCKER_TAG=cpu-s390x

				        GPU_IMAGE=redhat/ubi9

				        GPU_IMAGE=s390x/almalinux:8

				        DOCKER_GPU_BUILD_ARG=""

				        MANY_LINUX_VERSION="s390x"

				        ;;

				@ -87,22 +87,18 @@ case ${GPU_ARCH_TYPE} in

				        MANY_LINUX_VERSION="aarch64"

				        DOCKERFILE_SUFFIX="_cuda_aarch64"

				        ;;

				    rocm)

				    rocm|rocm-manylinux_2_28)

				        TARGET=rocm_final

				        DOCKER_TAG=rocm${GPU_ARCH_VERSION}

				        GPU_IMAGE=rocm/dev-centos-7:${GPU_ARCH_VERSION}-complete

				        PYTORCH_ROCM_ARCH="gfx900;gfx906;gfx908;gfx90a;gfx1030;gfx1100"

				        ROCM_REGEX="([0-9]+)\.([0-9]+)[\.]?([0-9]*)"

				        if [[ $GPU_ARCH_VERSION =~ $ROCM_REGEX ]]; then

				            ROCM_VERSION_INT=$((${BASH_REMATCH[1]}*10000 + ${BASH_REMATCH[2]}*100 + ${BASH_REMATCH[3]:-0}))

				        else

				            echo "ERROR: rocm regex failed"

				            exit 1

				        DEVTOOLSET_VERSION="9"

				        if [ ${GPU_ARCH_TYPE} == "rocm-manylinux_2_28" ]; then

				            MANY_LINUX_VERSION="2_28"

				            DEVTOOLSET_VERSION="11"

				            GPU_IMAGE=rocm/dev-almalinux-8:${GPU_ARCH_VERSION}-complete

				        fi

				        if [[ $ROCM_VERSION_INT -ge 60000 ]]; then

				            PYTORCH_ROCM_ARCH+=";gfx942"

				        fi

				        DOCKER_GPU_BUILD_ARG="--build-arg ROCM_VERSION=${GPU_ARCH_VERSION} --build-arg PYTORCH_ROCM_ARCH=${PYTORCH_ROCM_ARCH} --build-arg DEVTOOLSET_VERSION=9"

				        PYTORCH_ROCM_ARCH="gfx900;gfx906;gfx908;gfx90a;gfx942;gfx1030;gfx1100;gfx1101"

				        DOCKER_GPU_BUILD_ARG="--build-arg ROCM_VERSION=${GPU_ARCH_VERSION} --build-arg PYTORCH_ROCM_ARCH=${PYTORCH_ROCM_ARCH} --build-arg DEVTOOLSET_VERSION=${DEVTOOLSET_VERSION}"

				        ;;

				    xpu)

				        TARGET=xpu_final

				@ -124,7 +120,16 @@ if [[ -n ${MANY_LINUX_VERSION} && -z ${DOCKERFILE_SUFFIX} ]]; then

				fi

				(

				    set -x

				    DOCKER_BUILDKIT=1 docker build \

				    if [ "$(uname -m)" != "s390x" ]; then

				        # TODO: Remove LimitNOFILE=1048576 patch once https://github.com/pytorch/test-infra/issues/5712

				        # is resolved. This patch is required in order to fix timing out of Docker build on Amazon Linux 2023.

				        sudo sed -i s/LimitNOFILE=infinity/LimitNOFILE=1048576/ /usr/lib/systemd/system/docker.service

				        sudo systemctl daemon-reload

				        sudo systemctl restart docker

				    fi

				    DOCKER_BUILDKIT=1 docker build  \

				        ${DOCKER_GPU_BUILD_ARG} \

				        --build-arg "GPU_IMAGE=${GPU_IMAGE}" \

				        --target "${TARGET}" \

									
										61

.ci/docker/manywheel/build_scripts/build.sh
									
												View File
												
				@ -16,37 +16,27 @@ CURL_HASH=cf34fe0b07b800f1c01a499a6e8b2af548f6d0e044dca4a29d88a4bee146d131

				AUTOCONF_ROOT=autoconf-2.69

				AUTOCONF_HASH=954bd69b391edc12d6a4a51a2dd1476543da5c6bbf05a95b59dc0dd6fd4c2969

				# Dependencies for compiling Python that we want to remove from

				# the final image after compiling Python

				PYTHON_COMPILE_DEPS="zlib-devel bzip2-devel ncurses-devel sqlite-devel readline-devel tk-devel gdbm-devel libpcap-devel xz-devel libffi-devel"

				if [ "$(uname -m)" != "s390x" ] ; then

				    PYTHON_COMPILE_DEPS="${PYTHON_COMPILE_DEPS} db4-devel"

				else

				    PYTHON_COMPILE_DEPS="${PYTHON_COMPILE_DEPS} libdb-devel"

				fi

				# Libraries that are allowed as part of the manylinux1 profile

				MANYLINUX1_DEPS="glibc-devel libstdc++-devel glib2-devel libX11-devel libXext-devel libXrender-devel  mesa-libGL-devel libICE-devel libSM-devel ncurses-devel"

				# Get build utilities

				MY_DIR=$(dirname "${BASH_SOURCE[0]}")

				source $MY_DIR/build_utils.sh

				if [ "$(uname -m)" != "s390x" ] ; then

				    # Dependencies for compiling Python that we want to remove from

				    # the final image after compiling Python

				    PYTHON_COMPILE_DEPS="zlib-devel bzip2-devel ncurses-devel sqlite-devel readline-devel tk-devel gdbm-devel db4-devel libpcap-devel xz-devel libffi-devel"

				    # Libraries that are allowed as part of the manylinux1 profile

				    MANYLINUX1_DEPS="glibc-devel libstdc++-devel glib2-devel libX11-devel libXext-devel libXrender-devel  mesa-libGL-devel libICE-devel libSM-devel ncurses-devel"

				    # Development tools and libraries

				    yum -y install bzip2 make git patch unzip bison yasm diffutils \

				        automake which file cmake28 \

				        kernel-devel-`uname -r` \

				        ${PYTHON_COMPILE_DEPS}

				else

				    # Dependencies for compiling Python that we want to remove from

				    # the final image after compiling Python

				    PYTHON_COMPILE_DEPS="zlib1g-dev libbz2-dev libncurses-dev libsqlite3-dev libdb-dev libpcap-dev liblzma-dev libffi-dev"

				    # Libraries that are allowed as part of the manylinux1 profile

				    MANYLINUX1_DEPS="libglib2.0-dev libX11-dev libncurses-dev"

				    # Development tools and libraries

				    apt install -y bzip2 make git patch unzip diffutils \

				        automake which file cmake \

				        linux-headers-virtual \

				        ${PYTHON_COMPILE_DEPS}

				fi

				# Development tools and libraries

				yum -y install bzip2 make git patch unzip bison yasm diffutils \

				    automake which file \

				    ${PYTHON_COMPILE_DEPS}

				# Install newest autoconf

				build_autoconf $AUTOCONF_ROOT $AUTOCONF_HASH

				@ -92,16 +82,13 @@ ln -s $PY39_BIN/auditwheel /usr/local/bin/auditwheel

				# Clean up development headers and other unnecessary stuff for

				# final image

				if [ "$(uname -m)" != "s390x" ] ; then

				    yum -y erase wireless-tools gtk2 libX11 hicolor-icon-theme \

				        avahi freetype bitstream-vera-fonts \

				        ${PYTHON_COMPILE_DEPS} || true > /dev/null 2>&1

				    yum -y install ${MANYLINUX1_DEPS}

				    yum -y clean all > /dev/null 2>&1

				    yum list installed

				else

				    apt purge -y ${PYTHON_COMPILE_DEPS} || true > /dev/null 2>&1

				fi

				yum -y erase wireless-tools gtk2 libX11 hicolor-icon-theme \

				    avahi freetype bitstream-vera-fonts \

				    ${PYTHON_COMPILE_DEPS} || true > /dev/null 2>&1

				yum -y install ${MANYLINUX1_DEPS}

				yum -y clean all > /dev/null 2>&1

				yum list installed

				# we don't need libpython*.a, and they're many megabytes

				find /opt/_internal -name '*.a' -print0 | xargs -0 rm -f

				# Strip what we can -- and ignore errors, because this just attempts to strip

									
										14

.ci/docker/manywheel/build_scripts/ssl-check.py
									
												View File
												
				@ -1,10 +1,12 @@

				# cf. https://github.com/pypa/manylinux/issues/53

				import sys

				from urllib.request import urlopen

				GOOD_SSL = "https://google.com"

				BAD_SSL = "https://self-signed.badssl.com"

				import sys

				print("Testing SSL certificate checking for Python:", sys.version)

				@ -12,14 +14,8 @@ if sys.version_info[:2] < (2, 7) or sys.version_info[:2] < (3, 4):

				    print("This version never checks SSL certs; skipping tests")

				    sys.exit(0)

				if sys.version_info[0] >= 3:

				    from urllib.request import urlopen

				    EXC = OSError

				else:

				    from urllib import urlopen

				    EXC = IOError

				EXC = OSError

				print(f"Connecting to {GOOD_SSL} should work")

				urlopen(GOOD_SSL)

83

.ci/docker/requirements-ci.txt

View File

 @ -5,7 +5,7 @@
 #Pinned versions: 1.6
 #test that import:
 boto3==1.19.12
 boto3==1.35.42
 #Description: AWS SDK for python
 #Pinned versions: 1.19.12, 1.16.34
 #test that import:
 @ -30,9 +30,14 @@ dill==0.3.7
 #Pinned versions: 0.3.7
 #test that import: dynamo/test_replay_record.py test_dataloader.py test_datapipe.py test_serialization.py
 expecttest==0.1.6
 expecttest==0.3.0
 #Description: method for writing tests where test framework auto populates
 # the expected output based on previous runs
 #Pinned versions: 0.3.0
 #test that import:
 fbscribelogger==0.1.7
 #Description: write to scribe from authenticated jobs on CI
 #Pinned versions: 0.1.6
 #test that import:
 @ -85,7 +90,7 @@ librosa>=0.6.2 ; python_version < "3.11"
 #Pinned versions:
 #test that import:
 mypy==1.10.0
 mypy==1.13.0
 # Pin MyPy version because new errors are likely to appear with each release
 #Description: linter
 #Pinned versions: 1.10.0
 @ -113,7 +118,7 @@ numba==0.55.2 ; python_version == "3.10"
 #numpy
 #Description: Provides N-dimensional arrays and linear algebra
 #Pinned versions: 1.20
 #Pinned versions: 1.26.2
 #test that import: test_view_ops.py, test_unary_ufuncs.py, test_type_promotion.py,
 #test_type_info.py, test_torch.py, test_tensorexpr_pybind.py, test_tensorexpr.py,
 #test_tensorboard.py, test_tensor_creation_ops.py, test_static_runtime.py,
 @ -123,6 +128,12 @@ numba==0.55.2 ; python_version == "3.10"
 #test_nn.py, test_namedtensor.py, test_linalg.py, test_jit_cuda_fuser.py,
 #test_jit.py, test_indexing.py, test_datapipe.py, test_dataloader.py,
 #test_binary_ufuncs.py
 numpy==1.22.4; python_version == "3.9" or python_version == "3.10"
 numpy==1.26.2; python_version == "3.11" or python_version == "3.12"
 numpy==2.1.2; python_version >= "3.13"
 pandas==2.0.3; python_version < "3.13"
 pandas==2.2.3; python_version >= "3.13"
 #onnxruntime
 #Description: scoring engine for Open Neural Network Exchange (ONNX) models
 @ -134,9 +145,9 @@ opt-einsum==3.3
 #Pinned versions: 3.3
 #test that import: test_linalg.py
 optree==0.12.1
 optree==0.13.0
 #Description: A library for tree manipulation
 #Pinned versions: 0.12.1
 #Pinned versions: 0.13.0
 #test that import: test_vmap.py, test_aotdispatch.py, test_dynamic_shapes.py,
 #test_pytree.py, test_ops.py, test_control_flow.py, test_modules.py,
 #common_utils.py, test_eager_transforms.py, test_python_dispatch.py,
 @ -147,7 +158,7 @@ optree==0.12.1
 #test_pointwise_ops.py, test_dtensor_ops.py, test_torchinductor.py, test_fx.py,
 #test_fake_tensor.py, test_mps.py
 pillow==10.3.0
 pillow==11.0.0
 #Description:  Python Imaging Library fork
 #Pinned versions: 10.3.0
 #test that import:
 @ -182,6 +193,11 @@ pytest-rerunfailures>=10.3
 #Pinned versions:
 #test that import:
 pytest-subtests==0.13.1
 #Description: plugin for subtest support
 #Pinned versions:
 #test that import:
 #pytest-benchmark
 #Description: fixture for benchmarking code
 #Pinned versions: 3.2.3
 @ -229,7 +245,7 @@ scikit-image==0.22.0 ; python_version >= "3.10"
 #test that import:
 scipy==1.10.1 ; python_version <= "3.11"
 scipy==1.12.0 ; python_version == "3.12"
 scipy==1.14.1 ; python_version >= "3.12"
 # Pin SciPy because of failing distribution tests (see #60347)
 #Description: scientific python
 #Pinned versions: 1.10.1
 @ -248,7 +264,7 @@ tb-nightly==2.13.0a20230426
 #test that import:
 # needed by torchgen utils
 typing-extensions
 typing-extensions>=4.10.0
 #Description: type hints for python
 #Pinned versions:
 #test that import:
 @ -264,26 +280,21 @@ unittest-xml-reporting<=3.2.0,>=2.0.0
 #test that import:
 #lintrunner is supported on aarch64-linux only from 0.12.4 version
 lintrunner==0.12.5
 lintrunner==0.12.7
 #Description: all about linters!
 #Pinned versions: 0.12.5
 #Pinned versions: 0.12.7
 #test that import:
 redis>=4.0.0
 #Description: redis database
 #test that import: anything that tests OSS caching/mocking (inductor/test_codecache.py, inductor/test_max_autotune.py)
 rockset==1.0.3
 #Description: queries Rockset
 #Pinned versions: 1.0.3
 #test that import:
 ghstack==0.8.0
 #Description: ghstack tool
 #Pinned versions: 0.8.0
 #test that import:
 jinja2==3.1.4
 jinja2==3.1.5
 #Description: jinja2 template engine
 #Pinned versions: 3.1.4
 #test that import:
 @ -298,32 +309,32 @@ z3-solver==4.12.2.0
 #Pinned versions:
 #test that import:
 tensorboard==2.13.0
 tensorboard==2.13.0 ; python_version < "3.13"
 tensorboard==2.18.0 ; python_version >= "3.13"
 #Description: Also included in .ci/docker/requirements-docs.txt
 #Pinned versions:
 #test that import: test_tensorboard
 pywavelets==1.4.1 ; python_version < "3.12"
 pywavelets==1.5.0 ; python_version >= "3.12"
 pywavelets==1.7.0 ; python_version >= "3.12"
 #Description: This is a requirement of scikit-image, we need to pin
 # it here because 1.5.0 conflicts with numpy 1.21.2 used in CI
 #Pinned versions: 1.4.1
 #test that import:
 lxml==5.0.0
 lxml==5.3.0
 #Description: This is a requirement of unittest-xml-reporting
 # Python-3.9 binaries
 PyGithub==2.3.0
 sympy==1.12.1 ; python_version == "3.8"
 sympy==1.13.1 ; python_version >= "3.9"
 #Description: Required by coremltools, also pinned in .github/requirements/pip-requirements-macOS.txt
 #Pinned versions:
 #test that import:
 onnx==1.16.1
 onnx==1.17.0
 #Description: Required by mypy and test_public_bindings.py when checking torch.onnx._internal
 #Pinned versions:
 #test that import:
 @ -332,3 +343,31 @@ onnxscript==0.1.0.dev20240817
 #Description: Required by mypy and test_public_bindings.py when checking torch.onnx._internal
 #Pinned versions:
 #test that import:
 parameterized==0.8.1
 #Description: Parameterizes unittests, both the tests themselves and the entire testing class
 #Pinned versions:
 #test that import:
 #Description: required for testing torch/distributed/_tools/sac_estimator.py
 #Pinned versions: 1.24.0
 #test that import: test_sac_estimator.py
 pwlf==2.2.1 ; python_version >= "3.8"
 #Description: required for testing torch/distributed/_tools/sac_estimator.py
 #Pinned versions: 2.2.1
 #test that import: test_sac_estimator.py
 # To build PyTorch itself
 astunparse
 PyYAML
 setuptools
 ninja==1.11.1 ; platform_machine == "aarch64"
 scons==4.5.2 ; platform_machine == "aarch64"
 pulp==2.9.0 ; python_version >= "3.8"
 #Description: required for testing ilp formulaiton under torch/distributed/_tools
 #Pinned versions: 2.9.0
 #test that import: test_sac_ilp.py

3

.ci/docker/requirements-docs.txt

View File

 @ -14,7 +14,8 @@ matplotlib==3.5.3
 #Description: This is used to generate PyTorch docs
 #Pinned versions: 3.5.3
 tensorboard==2.13.0
 tensorboard==2.13.0 ; python_version < "3.13"
 tensorboard==2.18.0 ; python_version >= "3.13"
 #Description: This is used to generate PyTorch docs
 #Pinned versions: 2.13.0

2

.ci/docker/triton_version.txt

View File

 @ -1 +1 @@
 .0.0
 .2.0

									
										5

.ci/docker/ubuntu-cuda/Dockerfile
									
												View File
												
				@ -30,7 +30,8 @@ ARG CONDA_CMAKE

				COPY requirements-ci.txt /opt/conda/requirements-ci.txt

				COPY ./common/install_conda.sh install_conda.sh

				COPY ./common/common_utils.sh common_utils.sh

				RUN bash ./install_conda.sh && rm install_conda.sh common_utils.sh /opt/conda/requirements-ci.txt

				COPY ./common/install_magma_conda.sh install_magma_conda.sh

				RUN bash ./install_conda.sh && rm install_conda.sh install_magma_conda.sh common_utils.sh /opt/conda/requirements-ci.txt

				# Install gcc

				ARG GCC_VERSION

				@ -80,6 +81,8 @@ RUN bash ./install_openssl.sh

				ENV OPENSSL_DIR /opt/openssl

				ARG INDUCTOR_BENCHMARKS

				ARG ANACONDA_PYTHON_VERSION

				ENV ANACONDA_PYTHON_VERSION=$ANACONDA_PYTHON_VERSION

				COPY ./common/install_inductor_benchmark_deps.sh install_inductor_benchmark_deps.sh

				COPY ./common/common_utils.sh common_utils.sh

				COPY ci_commit_pins/huggingface.txt huggingface.txt

									
										15

.ci/docker/ubuntu-rocm/Dockerfile
									
												View File
												
				@ -68,6 +68,8 @@ RUN rm install_rocm.sh

				COPY ./common/install_rocm_magma.sh install_rocm_magma.sh

				RUN bash ./install_rocm_magma.sh

				RUN rm install_rocm_magma.sh

				ADD ./common/install_miopen.sh install_miopen.sh

				RUN bash ./install_miopen.sh ${ROCM_VERSION} && rm install_miopen.sh

				ENV ROCM_PATH /opt/rocm

				ENV PATH /opt/rocm/bin:$PATH

				ENV PATH /opt/rocm/hcc/bin:$PATH

				@ -100,10 +102,10 @@ ARG TRITON

				# try to reach out to S3, which docker build runners don't have access

				COPY ./common/install_triton.sh install_triton.sh

				COPY ./common/common_utils.sh common_utils.sh

				COPY ci_commit_pins/triton-rocm.txt triton-rocm.txt

				COPY ci_commit_pins/triton.txt triton.txt

				COPY triton_version.txt triton_version.txt

				RUN if [ -n "${TRITON}" ]; then bash ./install_triton.sh; fi

				RUN rm install_triton.sh common_utils.sh triton-rocm.txt triton_version.txt

				RUN rm install_triton.sh common_utils.sh triton.txt triton_version.txt

				# Install AOTriton

				COPY ./aotriton_version.txt aotriton_version.txt

				@ -112,6 +114,12 @@ COPY ./common/install_aotriton.sh install_aotriton.sh

				RUN ["/bin/bash", "-c", "./install_aotriton.sh /opt/rocm && rm -rf install_aotriton.sh aotriton_version.txt common_utils.sh"]

				ENV AOTRITON_INSTALLED_PREFIX /opt/rocm/aotriton

				# This is needed by sccache

				COPY ./common/install_openssl.sh install_openssl.sh

				ENV OPENSSL_ROOT_DIR /opt/openssl

				RUN bash ./install_openssl.sh

				ENV OPENSSL_DIR /opt/openssl

				# Install ccache/sccache (do this last, so we get priority in PATH)

				COPY ./common/install_cache.sh install_cache.sh

				ENV PATH /opt/cache/bin:$PATH

				@ -121,5 +129,8 @@ RUN bash ./install_cache.sh && rm install_cache.sh

				ARG BUILD_ENVIRONMENT

				ENV BUILD_ENVIRONMENT ${BUILD_ENVIRONMENT}

				# Install LLVM dev version (Defined in the pytorch/builder github repository)

				COPY --from=pytorch/llvm:9.0.1 /opt/llvm /opt/llvm

				USER jenkins

				CMD ["bash"]

									
										23

.ci/docker/ubuntu/Dockerfile
									
												View File
												
				@ -36,7 +36,8 @@ ENV DOCS=$DOCS

				COPY requirements-ci.txt requirements-docs.txt /opt/conda/

				COPY ./common/install_conda.sh install_conda.sh

				COPY ./common/common_utils.sh common_utils.sh

				RUN bash ./install_conda.sh && rm install_conda.sh common_utils.sh /opt/conda/requirements-ci.txt /opt/conda/requirements-docs.txt

				COPY ./common/install_magma_conda.sh install_magma_conda.sh

				RUN bash ./install_conda.sh && rm install_conda.sh install_magma_conda.sh common_utils.sh /opt/conda/requirements-ci.txt /opt/conda/requirements-docs.txt

				RUN if [ -n "${UNINSTALL_DILL}" ]; then pip uninstall -y dill; fi

				# Install gcc

				@ -87,19 +88,6 @@ RUN if [ -n "${VISION}" ]; then bash ./install_vision.sh; fi

				RUN rm install_vision.sh cache_vision_models.sh common_utils.sh

				ENV INSTALLED_VISION ${VISION}

				# (optional) Install Android NDK

				ARG ANDROID

				ARG ANDROID_NDK

				ARG GRADLE_VERSION

				COPY ./common/install_android.sh ./common/cache_vision_models.sh ./common/common_utils.sh ./

				COPY ./android/AndroidManifest.xml AndroidManifest.xml

				COPY ./android/build.gradle build.gradle

				RUN if [ -n "${ANDROID}" ]; then bash ./install_android.sh; fi

				RUN rm install_android.sh cache_vision_models.sh common_utils.sh

				RUN rm AndroidManifest.xml

				RUN rm build.gradle

				ENV INSTALLED_ANDROID ${ANDROID}

				# (optional) Install Vulkan SDK

				ARG VULKAN_SDK_VERSION

				COPY ./common/install_vulkan_sdk.sh install_vulkan_sdk.sh

				@ -147,6 +135,13 @@ COPY ci_commit_pins/triton.txt triton.txt

				RUN if [ -n "${TRITON}" ]; then bash ./install_triton.sh; fi

				RUN rm install_triton.sh common_utils.sh triton.txt

				ARG TRITON_CPU

				COPY ./common/install_triton.sh install_triton.sh

				COPY ./common/common_utils.sh common_utils.sh

				COPY ci_commit_pins/triton-cpu.txt triton-cpu.txt

				RUN if [ -n "${TRITON_CPU}" ]; then bash ./install_triton.sh; fi

				RUN rm install_triton.sh common_utils.sh triton-cpu.txt

				ARG EXECUTORCH

				# Build and install executorch

				COPY ./common/install_executorch.sh install_executorch.sh

									
										10

.ci/libtorch/build.sh
									
										Normal file
									
												View File
												
				@ -0,0 +1,10 @@

				#!/usr/bin/env bash

				# This is mostly just a shim to manywheel/build.sh

				# TODO: Make this a dedicated script to build just libtorch

				set -ex

				SCRIPTPATH="$( cd "$( dirname "${BASH_SOURCE[0]}" )" >/dev/null 2>&1 && pwd )"

				USE_CUSPARSELT=0 BUILD_PYTHONLESS=1 DESIRED_PYTHON="3.9" ${SCRIPTPATH}/../manywheel/build.sh

2

.ci/magma/.gitignore vendored Normal file

View File

 @ -0,0 +1,2 @@
 output/
 magma-cuda*/

									
										48

.ci/magma/Makefile
									
										Normal file
									
												View File
												
				@ -0,0 +1,48 @@

				SHELL=/usr/bin/env bash

				DOCKER_CMD ?= docker

				DESIRED_CUDA ?= 11.8

				DESIRED_CUDA_SHORT = $(subst .,,$(DESIRED_CUDA))

				PACKAGE_NAME = magma-cuda

				CUDA_ARCH_LIST ?= -gencode arch=compute_50,code=sm_50 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_90,code=sm_90

				DOCKER_RUN = set -eou pipefail; ${DOCKER_CMD} run --rm -i \

					-v $(shell git rev-parse --show-toplevel)/.ci:/builder \

					-w /builder \

					-e PACKAGE_NAME=${PACKAGE_NAME}${DESIRED_CUDA_SHORT} \

					-e DESIRED_CUDA=${DESIRED_CUDA} \

					-e CUDA_ARCH_LIST="${CUDA_ARCH_LIST}" \

					"pytorch/manylinux-builder:cuda${DESIRED_CUDA}-main" \

					magma/build_magma.sh

				.PHONY: all

				all: magma-cuda126

				all: magma-cuda124

				all: magma-cuda121

				all: magma-cuda118

				.PHONY:

				clean:

					$(RM) -r magma-*

					$(RM) -r output

				.PHONY: magma-cuda126

				magma-cuda126: DESIRED_CUDA := 12.6

				magma-cuda126:

					$(DOCKER_RUN)

				.PHONY: magma-cuda124

				magma-cuda124: DESIRED_CUDA := 12.4

				magma-cuda124:

					$(DOCKER_RUN)

				.PHONY: magma-cuda121

				magma-cuda121: DESIRED_CUDA := 12.1

				magma-cuda121:

					$(DOCKER_RUN)

				.PHONY: magma-cuda118

				magma-cuda118: DESIRED_CUDA := 11.8

				magma-cuda118: CUDA_ARCH_LIST += -gencode arch=compute_37,code=sm_37

				magma-cuda118:

					$(DOCKER_RUN)

									
										50

.ci/magma/README.md
									
										Normal file
									
												View File
												
				@ -0,0 +1,50 @@

				# Magma

				This folder contains the scripts and configurations to build magma, statically linked for various versions of CUDA.

				## Building

				Look in the `Makefile` for available targets to build. To build any target, for example `magma-cuda118`, run

				```

				# Using `docker`

				make magma-cuda118

				# Using `podman`

				DOCKER_CMD=podman make magma-cuda118

				```

				This spawns a `pytorch/manylinux-cuda<version>` docker image, which has the required `devtoolset` and CUDA versions installed.

				Within the docker image, it runs `build_magma.sh` with the correct environment variables set, which package the necessary files

				into a tarball, with the following structure:

				```

				.

				├── include       # header files

				├── lib           # libmagma.a

				├── info

				│   ├── licenses  # license file

				│   └── recipe    # build script and patches

				```

				More specifically, `build_magma.sh` copies over the relevant files from the `package_files` directory depending on the CUDA version.

				Outputted binaries should be in the `output` folder.

				## Pushing

				Packages can be uploaded to an S3 bucket using:

				```

				aws s3 cp output/*/magma-cuda*.bz2 <bucket-with-path>

				```

				If you do not have upload permissions, please ping @seemethere or @soumith to gain access

				## New versions

				New CUDA versions can be added by creating a new make target with the next desired version. For CUDA version NN.n, the target should be named `magma-cudaNNn`.

				Make sure to edit the appropriate environment variables (e.g., DESIRED_CUDA, CUDA_ARCH_LIST) in the `Makefile` accordingly. Remember also to check `build_magma.sh` to ensure the logic for copying over the files remains correct.

				New patches can be added by editing `Makefile` and`build_magma.sh` the same way `getrf_nbparam.patch` is implemented.

									
										50

.ci/magma/build_magma.sh
									
										Executable file
									
												View File
												
				@ -0,0 +1,50 @@

				#!/usr/bin/env bash

				set -eou pipefail

				# Environment variables

				# The script expects DESIRED_CUDA and PACKAGE_NAME to be set

				ROOT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")/.." && pwd)"

				MAGMA_VERSION=2.6.1

				# Folders for the build

				PACKAGE_FILES=${ROOT_DIR}/magma/package_files # source patches and metadata

				PACKAGE_DIR=${ROOT_DIR}/magma/${PACKAGE_NAME} # build workspace

				PACKAGE_OUTPUT=${ROOT_DIR}/magma/output # where tarballs are stored

				PACKAGE_BUILD=${PACKAGE_DIR}/build # where the content of the tarball is prepared

				PACKAGE_RECIPE=${PACKAGE_BUILD}/info/recipe

				PACKAGE_LICENSE=${PACKAGE_BUILD}/info/licenses

				mkdir -p ${PACKAGE_DIR} ${PACKAGE_OUTPUT}/linux-64 ${PACKAGE_BUILD} ${PACKAGE_RECIPE} ${PACKAGE_LICENSE}

				# Fetch magma sources and verify checksum

				pushd ${PACKAGE_DIR}

				curl -LO http://icl.utk.edu/projectsfiles/magma/downloads/magma-${MAGMA_VERSION}.tar.gz

				tar zxf magma-${MAGMA_VERSION}.tar.gz

				sha256sum --check < ${PACKAGE_FILES}/magma-${MAGMA_VERSION}.sha256

				popd

				# Apply patches and build

				pushd ${PACKAGE_DIR}/magma-${MAGMA_VERSION}

				patch < ${PACKAGE_FILES}/CMake.patch

				patch < ${PACKAGE_FILES}/cmakelists.patch

				patch -p0 < ${PACKAGE_FILES}/thread_queue.patch

				patch -p1 < ${PACKAGE_FILES}/getrf_shfl.patch

				patch -p1 < ${PACKAGE_FILES}/getrf_nbparam.patch

				# The build.sh script expects to be executed from the sources root folder

				INSTALL_DIR=${PACKAGE_BUILD} ${PACKAGE_FILES}/build.sh

				popd

				# Package recipe, license and tarball

				# Folder and package name are backward compatible for the build workflow

				cp ${PACKAGE_FILES}/build.sh ${PACKAGE_RECIPE}/build.sh

				cp ${PACKAGE_FILES}/thread_queue.patch ${PACKAGE_RECIPE}/thread_queue.patch

				cp ${PACKAGE_FILES}/cmakelists.patch ${PACKAGE_RECIPE}/cmakelists.patch

				cp ${PACKAGE_FILES}/getrf_shfl.patch ${PACKAGE_RECIPE}/getrf_shfl.patch

				cp ${PACKAGE_FILES}/getrf_nbparam.patch ${PACKAGE_RECIPE}/getrf_nbparam.patch

				cp ${PACKAGE_FILES}/CMake.patch ${PACKAGE_RECIPE}/CMake.patch

				cp ${PACKAGE_FILES}/magma-${MAGMA_VERSION}.sha256 ${PACKAGE_RECIPE}/magma-${MAGMA_VERSION}.sha256

				cp ${PACKAGE_DIR}/magma-${MAGMA_VERSION}/COPYRIGHT ${PACKAGE_LICENSE}/COPYRIGHT

				pushd ${PACKAGE_BUILD}

				tar cjf ${PACKAGE_OUTPUT}/linux-64/${PACKAGE_NAME}-${MAGMA_VERSION}-1.tar.bz2 include lib info

				echo Built in ${PACKAGE_OUTPUT}/linux-64/${PACKAGE_NAME}-${MAGMA_VERSION}-1.tar.bz2

				popd

									
										40

.ci/magma/package_files/CMake.patch
									
										Normal file
									
												View File
												
				@ -0,0 +1,40 @@

				--- CMake.src.cuda	2023-03-29 10:05:32.136954140 +0000

				+++ CMake.src.cuda	2023-03-29 10:05:50.281318043 +0000

				@@ -283,10 +283,10 @@

				 magmablas/zgeadd.cu

				 magmablas/zgeadd2.cu

				 magmablas/zgeam.cu

				-magmablas/zgemm_fermi.cu

				+#magmablas/zgemm_fermi.cu

				 magmablas/zgemm_reduce.cu

				 magmablas/zgemv_conj.cu

				-magmablas/zgemv_fermi.cu

				+#magmablas/zgemv_fermi.cu

				 magmablas/zgerbt.cu

				 magmablas/zgerbt_kernels.cu

				 magmablas/zgetmatrix_transpose.cpp

				@@ -1009,18 +1009,18 @@

				 magmablas/sgeam.cu

				 magmablas/dgeam.cu

				 magmablas/cgeam.cu

				-magmablas/sgemm_fermi.cu

				-magmablas/dgemm_fermi.cu

				-magmablas/cgemm_fermi.cu

				+#magmablas/sgemm_fermi.cu

				+#magmablas/dgemm_fermi.cu

				+#magmablas/cgemm_fermi.cu

				 magmablas/sgemm_reduce.cu

				 magmablas/dgemm_reduce.cu

				 magmablas/cgemm_reduce.cu

				 magmablas/sgemv_conj.cu

				 magmablas/dgemv_conj.cu

				 magmablas/cgemv_conj.cu

				-magmablas/sgemv_fermi.cu

				-magmablas/dgemv_fermi.cu

				-magmablas/cgemv_fermi.cu

				+#magmablas/sgemv_fermi.cu

				+#magmablas/dgemv_fermi.cu

				+#magmablas/cgemv_fermi.cu

				 magmablas/sgerbt.cu

				 magmablas/dgerbt.cu

				 magmablas/cgerbt.cu

									
										12

.ci/magma/package_files/build.sh
									
										Executable file
									
												View File
												
				@ -0,0 +1,12 @@

				CUDA__VERSION=$(nvcc --version|sed -n 4p|cut -f5 -d" "|cut -f1 -d",")

				if [ "$CUDA__VERSION" != "$DESIRED_CUDA" ]; then

				    echo "CUDA Version is not $DESIRED_CUDA. CUDA Version found: $CUDA__VERSION"

				    exit 1

				fi

				mkdir build

				cd build

				cmake .. -DUSE_FORTRAN=OFF -DGPU_TARGET="All" -DCMAKE_INSTALL_PREFIX="$INSTALL_DIR" -DCUDA_ARCH_LIST="$CUDA_ARCH_LIST"

				make -j$(getconf _NPROCESSORS_CONF)

				make install

				cd ..

									
										388

.ci/magma/package_files/cmakelists.patch
									
										Normal file
									
												View File
												
				@ -0,0 +1,388 @@

				diff --git a/CMakeLists.txt b/CMakeLists.txt

				index d5d8d87d..8a507334 100644

				--- a/CMakeLists.txt

				+++ b/CMakeLists.txt

				@@ -3,7 +3,7 @@ cmake_minimum_required( VERSION 2.8.1 )

				 # ----------------------------------------

				 # to disable Fortran, set this to "off"

				 # see also -DADD_ below

				-option( USE_FORTRAN "Fortran is required for some tester checks, but can be disabled with reduced functionality" ON )

				+option( USE_FORTRAN "Fortran is required for some tester checks, but can be disabled with reduced functionality" OFF )

				 if (USE_FORTRAN)

				     project( MAGMA C CXX Fortran )

				@@ -75,6 +75,8 @@ else()

				     message( WARNING "The compiler ${CMAKE_CXX_COMPILER} doesn't support the -std=c++11 flag. Some code may not compile.")

				 endif()

				+set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -static-libstdc++ -fno-exceptions")

				+

				 CHECK_C_COMPILER_FLAG("-std=c99" COMPILER_SUPPORTS_C99)

				 if (COMPILER_SUPPORTS_C99)

				     set(CMAKE_C_FLAGS "${CMAKE_C_FLAGS} -std=c99")

				@@ -101,15 +103,15 @@ endif()

				 # ----------------------------------------

				-# locate OpenMP

				-find_package( OpenMP )

				-if (OPENMP_FOUND)

				-    message( STATUS "Found OpenMP" )

				-    message( STATUS "    OpenMP_C_FLAGS   ${OpenMP_C_FLAGS}" )

				-    message( STATUS "    OpenMP_CXX_FLAGS ${OpenMP_CXX_FLAGS}" )

				-    set( CMAKE_C_FLAGS   "${CMAKE_C_FLAGS} ${OpenMP_C_FLAGS}" )

				-    set( CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} ${OpenMP_CXX_FLAGS}" )

				-endif()

				+# # locate OpenMP

				+# find_package( OpenMP )

				+# if (OPENMP_FOUND)

				+#     message( STATUS "Found OpenMP" )

				+#     message( STATUS "    OpenMP_C_FLAGS   ${OpenMP_C_FLAGS}" )

				+#     message( STATUS "    OpenMP_CXX_FLAGS ${OpenMP_CXX_FLAGS}" )

				+#     set( CMAKE_C_FLAGS   "${CMAKE_C_FLAGS} ${OpenMP_C_FLAGS}" )

				+#     set( CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} ${OpenMP_CXX_FLAGS}" )

				+# endif()

				 if (MAGMA_ENABLE_CUDA)

				   # ----------------------------------------

				@@ -132,7 +134,7 @@ if (MAGMA_ENABLE_CUDA)

				     set( NV_SM    "" )

				     set( NV_COMP  "" )

				-    set(CUDA_SEPARABLE_COMPILATION ON)

				+    set(CUDA_SEPARABLE_COMPILATION OFF)

				     # nvcc >= 6.5 supports -std=c++11, so propagate CXXFLAGS to NVCCFLAGS.

				     # Older nvcc didn't support -std=c++11, so previously we disabled propagation.

				@@ -294,11 +296,18 @@ if (MAGMA_ENABLE_CUDA)

				         message( STATUS "    compile for CUDA arch 8.0 (Ampere)" )

				     endif()

				+    if ( ${GPU_TARGET} MATCHES "All")

				+        set( MIN_ARCH 370)

				+        SET( NV_SM ${CUDA_ARCH_LIST})

				+        SET( NV_COMP "")

				+    endif()

				+

				     if (NOT MIN_ARCH)

				         message( FATAL_ERROR "GPU_TARGET must contain one or more of Fermi, Kepler, Maxwell, Pascal, Volta, Turing, Ampere, or valid sm_[0-9][0-9]" )

				     endif()

				-    set( CUDA_NVCC_FLAGS ${CUDA_NVCC_FLAGS} -Xcompiler -fPIC ${NV_SM} ${NV_COMP} ${FORTRAN_CONVENTION} )

				+    set( CUDA_NVCC_FLAGS ${CUDA_NVCC_FLAGS} -DHAVE_CUBLAS -Xfatbin -compress-all -Xcompiler -fPIC -std=c++11 ${NV_SM} ${NV_COMP} ${FORTRAN_CONVENTION} )

				+    MESSAGE(STATUS "CUDA_NVCC_FLAGS: ${CUDA_NVCC_FLAGS}")

				     #add_definitions( "-DMAGMA_HAVE_CUDA -DMAGMA_CUDA_ARCH_MIN=${MIN_ARCH}" )

				     set(MAGMA_HAVE_CUDA "1")

				     set(MAGMA_CUDA_ARCH_MIN "${MIN_ARCH}")

				@@ -413,7 +422,7 @@ set_property(CACHE BLA_VENDOR PROPERTY STRINGS

				 set( LAPACK_LIBRARIES "" CACHE STRING "Libraries for LAPACK and BLAS, to manually override search" )

				 if (LAPACK_LIBRARIES STREQUAL "")

				     message( STATUS "Searching for BLAS and LAPACK. To override, set LAPACK_LIBRARIES using ccmake." )

				-    find_package( LAPACK )

				+    # find_package( LAPACK )

				     # force showing updated LAPACK_LIBRARIES in ccmake / cmake-gui.

				     set( LAPACK_LIBRARIES ${LAPACK_LIBRARIES} CACHE STRING "Libraries for LAPACK and BLAS, to manually override search" FORCE )

				 else()

				@@ -552,12 +561,12 @@ if (WIN32)

				     #message( "libmagma_all_f   ${libmagma_all_f}"   )

				     # on Windows, Fortran files aren't compiled if listed here...

				-    cuda_add_library( magma ${libmagma_all_cpp} )

				+    cuda_add_library( magma STATIC ${libmagma_all_cpp} OPTIONS --compiler-options "-fPIC")

				     target_link_libraries( magma

				         ${LAPACK_LIBRARIES}

				         ${CUDA_CUDART_LIBRARY}

				         ${CUDA_CUBLAS_LIBRARIES}

				-        ${CUDA_cusparse_LIBRARY}

				+        # ${CUDA_cusparse_LIBRARY}

				     )

				     # no Fortran files at the moment (how to test libmagma_all_f is not empty?),

				@@ -575,13 +584,13 @@ if (WIN32)

				 else()

				     # Unix doesn't seem to have a problem with mixing C, CUDA, and Fortran files

				     if (MAGMA_ENABLE_CUDA)

				-      cuda_add_library( magma ${libmagma_all} )

				+      cuda_add_library( magma STATIC ${libmagma_all} OPTIONS --compiler-options "-fPIC")

				       target_link_libraries( magma

				         ${blas_fix}

				         ${LAPACK_LIBRARIES}

				         ${CUDA_CUDART_LIBRARY}

				         ${CUDA_CUBLAS_LIBRARIES}

				-        ${CUDA_cusparse_LIBRARY}

				+        # ${CUDA_cusparse_LIBRARY}

				 	)

				     else()

				       find_package( hipBLAS )

				@@ -614,138 +623,139 @@ else()

				     endif()

				 endif()

				 add_custom_target( lib DEPENDS magma )

				-

				-

				-# ----------------------------------------

				-# compile lapacktest library

				-# If use fortran, compile only Fortran files, not magma_[sdcz]_no_fortran.cpp

				-# else,           compile only C++     files, not Fortran files

				-if (USE_FORTRAN)

				-    foreach( filename ${liblapacktest_all} )

				-        if (filename MATCHES "\\.(f|f90|F90)$")

				-            list( APPEND liblapacktest_all_f ${filename} )

				-        endif()

				-    endforeach()

				-    add_library( lapacktest ${liblapacktest_all_f} )

				-else()

				-    # alternatively, use only C/C++/CUDA files, including magma_[sdcz]_no_fortran.cpp

				-    foreach( filename ${liblapacktest_all} )

				-        if (filename MATCHES "\\.(c|cu|cpp)$")

				-            list( APPEND liblapacktest_all_cpp ${filename} )

				-        endif()

				-    endforeach()

				-    add_library( lapacktest ${liblapacktest_all_cpp} )

				-endif()

				-target_link_libraries( lapacktest

				-    ${blas_fix}

				-    ${LAPACK_LIBRARIES}

				-)

				-

				-

				-# ----------------------------------------

				-# compile tester library

				-add_library( tester ${libtest_all} )

				-target_link_libraries( tester

				-    magma

				-    lapacktest

				-    ${blas_fix}

				-    ${LAPACK_LIBRARIES}

				-)

				+set_target_properties(magma PROPERTIES POSITION_INDEPENDENT_CODE ON)

				+

				+

				+# # ----------------------------------------

				+# # compile lapacktest library

				+# # If use fortran, compile only Fortran files, not magma_[sdcz]_no_fortran.cpp

				+# # else,           compile only C++     files, not Fortran files

				+# if (USE_FORTRAN)

				+#     foreach( filename ${liblapacktest_all} )

				+#         if (filename MATCHES "\\.(f|f90|F90)$")

				+#             list( APPEND liblapacktest_all_f ${filename} )

				+#         endif()

				+#     endforeach()

				+#     add_library( lapacktest ${liblapacktest_all_f} )

				+# else()

				+#     # alternatively, use only C/C++/CUDA files, including magma_[sdcz]_no_fortran.cpp

				+#     foreach( filename ${liblapacktest_all} )

				+#         if (filename MATCHES "\\.(c|cu|cpp)$")

				+#             list( APPEND liblapacktest_all_cpp ${filename} )

				+#         endif()

				+#     endforeach()

				+#     add_library( lapacktest ${liblapacktest_all_cpp} )

				+# endif()

				+# target_link_libraries( lapacktest

				+#     ${blas_fix}

				+#     ${LAPACK_LIBRARIES}

				+# )

				+

				+

				+# # ----------------------------------------

				+# # compile tester library

				+# add_library( tester ${libtest_all} )

				+# target_link_libraries( tester

				+#     magma

				+#     lapacktest

				+#     ${blas_fix}

				+#     ${LAPACK_LIBRARIES}

				+# )

				 # ----------------------------------------

				 # compile MAGMA sparse library

				 # sparse doesn't have Fortran at the moment, so no need for above shenanigans

				-if (MAGMA_ENABLE_CUDA)

				-  include_directories( sparse/include )

				-  include_directories( sparse/control )

				-else()

				-  include_directories( sparse_hip/include )

				-  include_directories( sparse_hip/control )

				-endif()

				-include_directories( testing )

				-

				-if (MAGMA_ENABLE_CUDA)

				-  cuda_add_library( magma_sparse ${libsparse_all} )

				-  target_link_libraries( magma_sparse

				-    magma

				-    ${blas_fix}

				-    ${LAPACK_LIBRARIES}

				-    ${CUDA_CUDART_LIBRARY}

				-    ${CUDA_CUBLAS_LIBRARIES}

				-    ${CUDA_cusparse_LIBRARY}

				-    )

				-else()

				-  add_library( magma_sparse ${libsparse_all} )

				-  target_link_libraries( magma_sparse

				-    magma

				-    ${blas_fix}

				-    ${LAPACK_LIBRARIES}

				-    hip::device

				-    roc::hipblas

				-    roc::hipsparse

				-    )

				-endif()

				-add_custom_target( sparse-lib DEPENDS magma_sparse )

				-

				-

				-# ----------------------------------------

				-# compile each tester

				-

				-# save testers to testing/

				-# save tester lib files to testing_lib/ to avoid cluttering lib/

				-set( CMAKE_RUNTIME_OUTPUT_DIRECTORY testing )

				-set( CMAKE_ARCHIVE_OUTPUT_DIRECTORY testing_lib )

				-set( CMAKE_LIBRARY_OUTPUT_DIRECTORY testing_lib )

				-

				-# skip Fortran testers, which require an extra file from CUDA

				-foreach( filename ${testing_all} )

				-    if (filename MATCHES "\\.(c|cu|cpp)$")

				-        list( APPEND testing_all_cpp ${filename} )

				-    endif()

				-endforeach()

				-foreach( TEST ${testing_all_cpp} )

				-    string( REGEX REPLACE "\\.(cpp|f90|F90)" "" EXE ${TEST} )

				-    string( REGEX REPLACE "testing/" "" EXE ${EXE} )

				-    #message( "${TEST} --> ${EXE}" )

				-    add_executable( ${EXE} ${TEST} )

				-    target_link_libraries( ${EXE} tester lapacktest magma )

				-    list( APPEND testing ${EXE} )

				-endforeach()

				-add_custom_target( testing DEPENDS ${testing} )

				-

				-

				-# ----------------------------------------

				-# compile each sparse tester

				-

				-if (MAGMA_ENABLE_CUDA)

				-  set(SPARSE_TEST_DIR "sparse/testing")

				-else()

				-  set(SPARSE_TEST_DIR "sparse_hip/testing")

				-endif()

				-

				-

				-set( CMAKE_RUNTIME_OUTPUT_DIRECTORY "${SPARSE_TEST_DIR}" )

				-cmake_policy( SET CMP0037 OLD)

				-foreach( TEST ${sparse_testing_all} )

				-    string( REGEX REPLACE "\\.(cpp|f90|F90)"     "" EXE ${TEST} )

				-    string( REGEX REPLACE "${SPARSE_TEST_DIR}/" "" EXE ${EXE} )

				-    #message( "${TEST} --> ${EXE}" )

				-    add_executable( ${EXE} ${TEST} )

				-    target_link_libraries( ${EXE} magma_sparse magma )

				-    list( APPEND sparse-testing ${EXE} )

				-endforeach()

				-add_custom_target( sparse-testing DEPENDS ${sparse-testing} )

				+# if (MAGMA_ENABLE_CUDA)

				+#   include_directories( sparse/include )

				+#   include_directories( sparse/control )

				+# else()

				+#   include_directories( sparse_hip/include )

				+#   include_directories( sparse_hip/control )

				+# endif()

				+# include_directories( testing )

				+

				+# if (MAGMA_ENABLE_CUDA)

				+#   cuda_add_library( magma_sparse ${libsparse_all} )

				+#   target_link_libraries( magma_sparse

				+#     magma

				+#     ${blas_fix}

				+#     ${LAPACK_LIBRARIES}

				+#     ${CUDA_CUDART_LIBRARY}

				+#     ${CUDA_CUBLAS_LIBRARIES}

				+#     ${CUDA_cusparse_LIBRARY}

				+#     )

				+# else()

				+#   add_library( magma_sparse ${libsparse_all} )

				+#   target_link_libraries( magma_sparse

				+#     magma

				+#     ${blas_fix}

				+#     ${LAPACK_LIBRARIES}

				+#     hip::device

				+#     roc::hipblas

				+#     roc::hipsparse

				+#     )

				+# endif()

				+# add_custom_target( sparse-lib DEPENDS magma_sparse )

				+

				+

				+# # ----------------------------------------

				+# # compile each tester

				+

				+# # save testers to testing/

				+# # save tester lib files to testing_lib/ to avoid cluttering lib/

				+# set( CMAKE_RUNTIME_OUTPUT_DIRECTORY testing )

				+# set( CMAKE_ARCHIVE_OUTPUT_DIRECTORY testing_lib )

				+# set( CMAKE_LIBRARY_OUTPUT_DIRECTORY testing_lib )

				+

				+# # skip Fortran testers, which require an extra file from CUDA

				+# foreach( filename ${testing_all} )

				+#     if (filename MATCHES "\\.(c|cu|cpp)$")

				+#         list( APPEND testing_all_cpp ${filename} )

				+#     endif()

				+# endforeach()

				+# foreach( TEST ${testing_all_cpp} )

				+#     string( REGEX REPLACE "\\.(cpp|f90|F90)" "" EXE ${TEST} )

				+#     string( REGEX REPLACE "testing/" "" EXE ${EXE} )

				+#     #message( "${TEST} --> ${EXE}" )

				+#     add_executable( ${EXE} ${TEST} )

				+#     target_link_libraries( ${EXE} tester lapacktest magma )

				+#     list( APPEND testing ${EXE} )

				+# endforeach()

				+# add_custom_target( testing DEPENDS ${testing} )

				+

				+

				+# # ----------------------------------------

				+# # compile each sparse tester

				+

				+# if (MAGMA_ENABLE_CUDA)

				+#   set(SPARSE_TEST_DIR "sparse/testing")

				+# else()

				+#   set(SPARSE_TEST_DIR "sparse_hip/testing")

				+# endif()

				+

				+

				+# set( CMAKE_RUNTIME_OUTPUT_DIRECTORY "${SPARSE_TEST_DIR}" )

				+# cmake_policy( SET CMP0037 OLD)

				+# foreach( TEST ${sparse_testing_all} )

				+#     string( REGEX REPLACE "\\.(cpp|f90|F90)"     "" EXE ${TEST} )

				+#     string( REGEX REPLACE "${SPARSE_TEST_DIR}/" "" EXE ${EXE} )

				+#     #message( "${TEST} --> ${EXE}" )

				+#     add_executable( ${EXE} ${TEST} )

				+#     target_link_libraries( ${EXE} magma_sparse magma )

				+#     list( APPEND sparse-testing ${EXE} )

				+# endforeach()

				+# add_custom_target( sparse-testing DEPENDS ${sparse-testing} )

				 # ----------------------------------------

				 # what to install

				-install( TARGETS magma magma_sparse ${blas_fix}

				+install( TARGETS magma ${blas_fix}

				          RUNTIME DESTINATION bin

				          LIBRARY DESTINATION lib

				          ARCHIVE DESTINATION lib )

				-file( GLOB headers include/*.h sparse/include/*.h "${CMAKE_BINARY_DIR}/include/*.h" )

				+file( GLOB headers include/*.h "${CMAKE_BINARY_DIR}/include/*.h" )

				 if (USE_FORTRAN)

				     install( FILES ${headers} ${modules}

				              DESTINATION include )

				@@ -769,9 +779,9 @@ else()

				     "${blas_fix_lib} ${LAPACK_LIBS} hip::device roc::hipblas roc::hipsparse" )

				 endif()

				 set( MAGMA_REQUIRED "" )

				-configure_file( "${pkgconfig}.in" "${pkgconfig}" @ONLY )

				-install( FILES "${CMAKE_BINARY_DIR}/${pkgconfig}"

				-         DESTINATION lib/pkgconfig )

				+# configure_file( "${pkgconfig}.in" "${pkgconfig}" @ONLY )

				+# install( FILES "${CMAKE_BINARY_DIR}/${pkgconfig}"

				+#          DESTINATION lib/pkgconfig )

				 # ----------------------------------------

				 get_directory_property( compile_definitions COMPILE_DEFINITIONS )

									
										40

.ci/magma/package_files/getrf_nbparam.patch
									
										Normal file
									
												View File
												
				@ -0,0 +1,40 @@

				diff --git a/control/get_batched_crossover.cpp b/control/get_batched_crossover.cpp

				index 4ec57306..912f8608 100644

				--- a/control/get_batched_crossover.cpp

				+++ b/control/get_batched_crossover.cpp

				@@ -119,7 +119,7 @@ void magma_get_spotrf_batched_nbparam(magma_int_t n, magma_int_t *nb, magma_int_

				 void magma_get_zgetrf_batched_nbparam(magma_int_t n, magma_int_t *nb, magma_int_t *recnb)

				 {

				     *nb    = 64;

				-    *recnb = 32;

				+    *recnb = 16;

				     return;

				 }

				@@ -127,7 +127,7 @@ void magma_get_zgetrf_batched_nbparam(magma_int_t n, magma_int_t *nb, magma_int_

				 void magma_get_cgetrf_batched_nbparam(magma_int_t n, magma_int_t *nb, magma_int_t *recnb)

				 {

				     *nb    = 128;

				-    *recnb =  32;

				+    *recnb =  16;

				     return;

				 }

				@@ -135,7 +135,7 @@ void magma_get_cgetrf_batched_nbparam(magma_int_t n, magma_int_t *nb, magma_int_

				 void magma_get_dgetrf_batched_nbparam(magma_int_t n, magma_int_t *nb, magma_int_t *recnb)

				 {

				     *nb    = 128;

				-    *recnb =  32;

				+    *recnb =  16;

				     return;

				 }

				@@ -143,7 +143,7 @@ void magma_get_dgetrf_batched_nbparam(magma_int_t n, magma_int_t *nb, magma_int_

				 void magma_get_sgetrf_batched_nbparam(magma_int_t n, magma_int_t *nb, magma_int_t *recnb)

				 {

				     *nb    = 128;

				-    *recnb =  32;

				+    *recnb =  16;

				     return;

				 }

									
										15

.ci/magma/package_files/getrf_shfl.patch
									
										Normal file
									
												View File
												
				@ -0,0 +1,15 @@

				diff --git a/src/zgetrf_batched.cpp b/src/zgetrf_batched.cpp

				index 24a65a90..884d9352 100644

				--- a/src/zgetrf_batched.cpp

				+++ b/src/zgetrf_batched.cpp

				@@ -116,7 +116,9 @@ magma_zgetrf_batched(

				             return magma_zgetrf_batched_smallsq_noshfl( m, dA_array, ldda, ipiv_array, info_array, batchCount, queue );

				         }

				         else{

				-            return magma_zgetrf_batched_smallsq_shfl( m, dA_array, ldda, ipiv_array, info_array, batchCount, queue );

				+            // magma_cgetrf_batched_smallsq_shfl is broken, therefore let's call noshfl version for arch < 700

				+            // return magma_zgetrf_batched_smallsq_shfl( m, dA_array, ldda, ipiv_array, info_array, batchCount, queue );

				+            return magma_zgetrf_batched_smallsq_noshfl( m, dA_array, ldda, ipiv_array, info_array, batchCount, queue );

				         }

				         #else

				         return magma_zgetrf_batched_smallsq_noshfl( m, dA_array, ldda, ipiv_array, info_array, batchCount, queue );

1

.ci/magma/package_files/magma-2.6.1.sha256 Normal file

View File

				`@ -0,0 +1 @@`
				`6cd83808c6e8bc7a44028e05112b3ab4e579bcc73202ed14733f66661127e213 magma-2.6.1.tar.gz`

									
										20

.ci/magma/package_files/thread_queue.patch
									
										Normal file
									
												View File
												
				@ -0,0 +1,20 @@

				--- control/thread_queue.cpp	2016-08-30 06:37:49.000000000 -0700

				+++ control/thread_queue.cpp	2016-10-10 19:47:28.911580965 -0700

				@@ -15,7 +15,7 @@

				 {

				     if ( err != 0 ) {

				         fprintf( stderr, "Error: %s (%d)\n", strerror(err), err );

				-        throw std::exception();

				+        // throw std::exception();

				     }

				 }

				@@ -172,7 +172,7 @@

				     check( pthread_mutex_lock( &mutex ));

				     if ( quit_flag ) {

				         fprintf( stderr, "Error: push_task() called after quit()\n" );

				-        throw std::exception();

				+        // throw std::exception();

				     }

				     q.push( task );

				     ntask += 1;

21

.ci/manywheel/LICENSE Normal file

View File

 @ -0,0 +1,21 @@
 The MIT License (MIT)
 Copyright (c) 2016 manylinux
 Permission is hereby granted, free of charge, to any person obtaining a copy
 of this software and associated documentation files (the "Software"), to deal
 in the Software without restriction, including without limitation the rights
 to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 copies of the Software, and to permit persons to whom the Software is
 furnished to do so, subject to the following conditions:
 The above copyright notice and this permission notice shall be included in all
 copies or substantial portions of the Software.
 THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
 IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
 FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
 AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
 LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
 OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
 SOFTWARE.

									
										28

.ci/manywheel/build.sh
									
										Executable file
									
												View File
												
				@ -0,0 +1,28 @@

				#!/usr/bin/env bash

				set -ex

				SCRIPTPATH="$( cd "$( dirname "${BASH_SOURCE[0]}" )" >/dev/null 2>&1 && pwd )"

				case "${GPU_ARCH_TYPE:-BLANK}" in

				    BLANK)

				        # Legacy behavior for CircleCI

				        bash "${SCRIPTPATH}/build_cuda.sh"

				        ;;

				    cuda)

				        bash "${SCRIPTPATH}/build_cuda.sh"

				        ;;

				    rocm)

				        bash "${SCRIPTPATH}/build_rocm.sh"

				        ;;

				    cpu | cpu-cxx11-abi | cpu-s390x)

				        bash "${SCRIPTPATH}/build_cpu.sh"

				        ;;

				    xpu)

				        bash "${SCRIPTPATH}/build_xpu.sh"

				        ;;

				    *)

				        echo "Un-recognized GPU_ARCH_TYPE '${GPU_ARCH_TYPE}', exiting..."

				        exit 1

				        ;;

				esac

									
										498

.ci/manywheel/build_common.sh
									
										Normal file
									
												View File
												
				@ -0,0 +1,498 @@

				#!/usr/bin/env bash

				# meant to be called only from the neighboring build.sh and build_cpu.sh scripts

				set -ex

				SOURCE_DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" >/dev/null && pwd )"

				source ${SOURCE_DIR}/set_desired_python.sh

				if [[ -n "$BUILD_PYTHONLESS" && -z "$LIBTORCH_VARIANT" ]]; then

				    echo "BUILD_PYTHONLESS is set, so need LIBTORCH_VARIANT to also be set"

				    echo "LIBTORCH_VARIANT should be one of shared-with-deps shared-without-deps static-with-deps static-without-deps"

				    exit 1

				fi

				# Function to retry functions that sometimes timeout or have flaky failures

				retry () {

				    $*  || (sleep 1 && $*) || (sleep 2 && $*) || (sleep 4 && $*) || (sleep 8 && $*)

				}

				PLATFORM="manylinux2014_x86_64"

				# TODO move this into the Docker images

				OS_NAME=$(awk -F= '/^NAME/{print $2}' /etc/os-release)

				if [[ "$OS_NAME" == *"CentOS Linux"* ]]; then

				    retry yum install -q -y zip openssl

				elif [[ "$OS_NAME" == *"AlmaLinux"* ]]; then

				    retry yum install -q -y zip openssl

				    PLATFORM="manylinux_2_28_x86_64"

				elif [[ "$OS_NAME" == *"Red Hat Enterprise Linux"* ]]; then

				    retry dnf install -q -y zip openssl

				elif [[ "$OS_NAME" == *"Ubuntu"* ]]; then

				    # TODO: Remove this once nvidia package repos are back online

				    # Comment out nvidia repositories to prevent them from getting apt-get updated, see https://github.com/pytorch/pytorch/issues/74968

				    # shellcheck disable=SC2046

				    sed -i 's/.*nvidia.*/# &/' $(find /etc/apt/ -type f -name "*.list")

				    retry apt-get update

				    retry apt-get -y install zip openssl

				fi

				# We use the package name to test the package by passing this to 'pip install'

				# This is the env variable that setup.py uses to name the package. Note that

				# pip 'normalizes' the name first by changing all - to _

				if [[ -z "$TORCH_PACKAGE_NAME" ]]; then

				    TORCH_PACKAGE_NAME='torch'

				fi

				if [[ -z "$TORCH_NO_PYTHON_PACKAGE_NAME" ]]; then

				    TORCH_NO_PYTHON_PACKAGE_NAME='torch_no_python'

				fi

				TORCH_PACKAGE_NAME="$(echo $TORCH_PACKAGE_NAME | tr '-' '_')"

				TORCH_NO_PYTHON_PACKAGE_NAME="$(echo $TORCH_NO_PYTHON_PACKAGE_NAME | tr '-' '_')"

				echo "Expecting the built wheels to all be called '$TORCH_PACKAGE_NAME' or '$TORCH_NO_PYTHON_PACKAGE_NAME'"

				# Version: setup.py uses $PYTORCH_BUILD_VERSION.post$PYTORCH_BUILD_NUMBER if

				# PYTORCH_BUILD_NUMBER > 1

				build_version="$PYTORCH_BUILD_VERSION"

				build_number="$PYTORCH_BUILD_NUMBER"

				if [[ -n "$OVERRIDE_PACKAGE_VERSION" ]]; then

				    # This will be the *exact* version, since build_number<1

				    build_version="$OVERRIDE_PACKAGE_VERSION"

				    build_number=0

				fi

				if [[ -z "$build_version" ]]; then

				    build_version=1.0.0

				fi

				if [[ -z "$build_number" ]]; then

				    build_number=1

				fi

				export PYTORCH_BUILD_VERSION=$build_version

				export PYTORCH_BUILD_NUMBER=$build_number

				export CMAKE_LIBRARY_PATH="/opt/intel/lib:/lib:$CMAKE_LIBRARY_PATH"

				export CMAKE_INCLUDE_PATH="/opt/intel/include:$CMAKE_INCLUDE_PATH"

				if [[ -e /opt/openssl ]]; then

				    export OPENSSL_ROOT_DIR=/opt/openssl

				    export CMAKE_INCLUDE_PATH="/opt/openssl/include":$CMAKE_INCLUDE_PATH

				fi

				mkdir -p /tmp/$WHEELHOUSE_DIR

				export PATCHELF_BIN=/usr/local/bin/patchelf

				patchelf_version=$($PATCHELF_BIN --version)

				echo "patchelf version: " $patchelf_version

				if [[ "$patchelf_version" == "patchelf 0.9" ]]; then

				    echo "Your patchelf version is too old. Please use version >= 0.10."

				    exit 1

				fi

				########################################################

				# Compile wheels as well as libtorch

				#######################################################

				if [[ -z "$PYTORCH_ROOT" ]]; then

				    echo "Need to set PYTORCH_ROOT env variable"

				    exit 1

				fi

				pushd "$PYTORCH_ROOT"

				python setup.py clean

				retry pip install -qr requirements.txt

				case ${DESIRED_PYTHON} in

				  cp31*)

				    retry pip install -q --pre numpy==2.1.0

				    ;;

				  # Should catch 3.9+

				  *)

				    retry pip install -q --pre numpy==2.0.2

				    ;;

				esac

				if [[ "$DESIRED_DEVTOOLSET" == *"cxx11-abi"* ]]; then

				    export _GLIBCXX_USE_CXX11_ABI=1

				else

				    export _GLIBCXX_USE_CXX11_ABI=0

				fi

				if [[ "$DESIRED_CUDA" == *"rocm"* ]]; then

				    echo "Calling build_amd.py at $(date)"

				    python tools/amd_build/build_amd.py

				fi

				# This value comes from binary_linux_build.sh (and should only be set to true

				# for master / release branches)

				BUILD_DEBUG_INFO=${BUILD_DEBUG_INFO:=0}

				if [[ $BUILD_DEBUG_INFO == "1" ]]; then

				    echo "Building wheel and debug info"

				else

				    echo "BUILD_DEBUG_INFO was not set, skipping debug info"

				fi

				if [[ "$DISABLE_RCCL" = 1 ]]; then

				    echo "Disabling NCCL/RCCL in pyTorch"

				    USE_RCCL=0

				    USE_NCCL=0

				    USE_KINETO=0

				else

				    USE_RCCL=1

				    USE_NCCL=1

				    USE_KINETO=1

				fi

				echo "Calling setup.py bdist at $(date)"

				if [[ "$USE_SPLIT_BUILD" == "true" ]]; then

				    echo "Calling setup.py bdist_wheel for split build (BUILD_LIBTORCH_WHL)"

				    time EXTRA_CAFFE2_CMAKE_FLAGS=${EXTRA_CAFFE2_CMAKE_FLAGS[@]} \

				    BUILD_LIBTORCH_WHL=1 BUILD_PYTHON_ONLY=0 \

				    BUILD_LIBTORCH_CPU_WITH_DEBUG=$BUILD_DEBUG_INFO \

				    USE_NCCL=${USE_NCCL} USE_RCCL=${USE_RCCL} USE_KINETO=${USE_KINETO} \

				    python setup.py bdist_wheel -d /tmp/$WHEELHOUSE_DIR

				    echo "Finished setup.py bdist_wheel for split build (BUILD_LIBTORCH_WHL)"

				    echo "Calling setup.py bdist_wheel for split build (BUILD_PYTHON_ONLY)"

				    time EXTRA_CAFFE2_CMAKE_FLAGS=${EXTRA_CAFFE2_CMAKE_FLAGS[@]} \

				    BUILD_LIBTORCH_WHL=0 BUILD_PYTHON_ONLY=1 \

				    BUILD_LIBTORCH_CPU_WITH_DEBUG=$BUILD_DEBUG_INFO \

				    USE_NCCL=${USE_NCCL} USE_RCCL=${USE_RCCL} USE_KINETO=${USE_KINETO} \

				    python setup.py bdist_wheel -d /tmp/$WHEELHOUSE_DIR --cmake

				    echo "Finished setup.py bdist_wheel for split build (BUILD_PYTHON_ONLY)"

				else

				    time CMAKE_ARGS=${CMAKE_ARGS[@]} \

				        EXTRA_CAFFE2_CMAKE_FLAGS=${EXTRA_CAFFE2_CMAKE_FLAGS[@]} \

				        BUILD_LIBTORCH_CPU_WITH_DEBUG=$BUILD_DEBUG_INFO \

				        USE_NCCL=${USE_NCCL} USE_RCCL=${USE_RCCL} USE_KINETO=${USE_KINETO} \

				        python setup.py bdist_wheel -d /tmp/$WHEELHOUSE_DIR

				fi

				echo "Finished setup.py bdist at $(date)"

				# Build libtorch packages

				if [[ -n "$BUILD_PYTHONLESS" ]]; then

				    # Now build pythonless libtorch

				    # Note - just use whichever python we happen to be on

				    python setup.py clean

				    if [[ $LIBTORCH_VARIANT = *"static"* ]]; then

				        STATIC_CMAKE_FLAG="-DTORCH_STATIC=1"

				    fi

				    mkdir -p build

				    pushd build

				    echo "Calling tools/build_libtorch.py at $(date)"

				    time CMAKE_ARGS=${CMAKE_ARGS[@]} \

				         EXTRA_CAFFE2_CMAKE_FLAGS="${EXTRA_CAFFE2_CMAKE_FLAGS[@]} $STATIC_CMAKE_FLAG" \

				         python ../tools/build_libtorch.py

				    echo "Finished tools/build_libtorch.py at $(date)"

				    popd

				    mkdir -p libtorch/{lib,bin,include,share}

				    cp -r build/build/lib libtorch/

				    # for now, the headers for the libtorch package will just be copied in

				    # from one of the wheels (this is from when this script built multiple

				    # wheels at once)

				    ANY_WHEEL=$(ls /tmp/$WHEELHOUSE_DIR/torch*.whl | head -n1)

				    unzip -d any_wheel $ANY_WHEEL

				    if [[ -d any_wheel/torch/include ]]; then

				        cp -r any_wheel/torch/include libtorch/

				    else

				        cp -r any_wheel/torch/lib/include libtorch/

				    fi

				    cp -r any_wheel/torch/share/cmake libtorch/share/

				    rm -rf any_wheel

				    echo $PYTORCH_BUILD_VERSION > libtorch/build-version

				    echo "$(pushd $PYTORCH_ROOT && git rev-parse HEAD)" > libtorch/build-hash

				    mkdir -p /tmp/$LIBTORCH_HOUSE_DIR

				    if [[ "$DESIRED_DEVTOOLSET" == *"cxx11-abi"* ]]; then

				        LIBTORCH_ABI="cxx11-abi-"

				    else

				        LIBTORCH_ABI=

				    fi

				    zip -rq /tmp/$LIBTORCH_HOUSE_DIR/libtorch-$LIBTORCH_ABI$LIBTORCH_VARIANT-$PYTORCH_BUILD_VERSION.zip libtorch

				    cp /tmp/$LIBTORCH_HOUSE_DIR/libtorch-$LIBTORCH_ABI$LIBTORCH_VARIANT-$PYTORCH_BUILD_VERSION.zip \

				       /tmp/$LIBTORCH_HOUSE_DIR/libtorch-$LIBTORCH_ABI$LIBTORCH_VARIANT-latest.zip

				fi

				popd

				#######################################################################

				# ADD DEPENDENCIES INTO THE WHEEL

				#

				# auditwheel repair doesn't work correctly and is buggy

				# so manually do the work of copying dependency libs and patchelfing

				# and fixing RECORDS entries correctly

				######################################################################

				fname_with_sha256() {

				    HASH=$(sha256sum $1 | cut -c1-8)

				    DIRNAME=$(dirname $1)

				    BASENAME=$(basename $1)

				    # Do not rename nvrtc-builtins.so as they are dynamically loaded

				    # by libnvrtc.so

				    # Similarly don't mangle libcudnn and libcublas library names

				    if [[ $BASENAME == "libnvrtc-builtins.s"* || $BASENAME == "libcudnn"* || $BASENAME == "libcublas"*  ]]; then

				        echo $1

				    else

				        INITNAME=$(echo $BASENAME | cut -f1 -d".")

				        ENDNAME=$(echo $BASENAME | cut -f 2- -d".")

				        echo "$DIRNAME/$INITNAME-$HASH.$ENDNAME"

				    fi

				}

				fname_without_so_number() {

				    LINKNAME=$(echo $1 | sed -e 's/\.so.*/.so/g')

				    echo "$LINKNAME"

				}

				make_wheel_record() {

				    FPATH=$1

				    if echo $FPATH | grep RECORD >/dev/null 2>&1; then

				        # if the RECORD file, then

				        echo "\"$FPATH\",,"

				    else

				        HASH=$(openssl dgst -sha256 -binary $FPATH | openssl base64 | sed -e 's/+/-/g' | sed -e 's/\//_/g' | sed -e 's/=//g')

				        FSIZE=$(ls -nl $FPATH | awk '{print $5}')

				        echo "\"$FPATH\",sha256=$HASH,$FSIZE"

				    fi

				}

				replace_needed_sofiles() {

				    find $1 -name '*.so*' | while read sofile; do

				        origname=$2

				        patchedname=$3

				        if [[ "$origname" != "$patchedname" ]] || [[ "$DESIRED_CUDA" == *"rocm"* ]]; then

				            set +e

				            origname=$($PATCHELF_BIN --print-needed $sofile | grep "$origname.*")

				            ERRCODE=$?

				            set -e

				            if [ "$ERRCODE" -eq "0" ]; then

				                echo "patching $sofile entry $origname to $patchedname"

				                $PATCHELF_BIN --replace-needed $origname $patchedname $sofile

				            fi

				        fi

				    done

				}

				echo 'Built this wheel:'

				ls /tmp/$WHEELHOUSE_DIR

				mkdir -p "/$WHEELHOUSE_DIR"

				mv /tmp/$WHEELHOUSE_DIR/torch*linux*.whl /$WHEELHOUSE_DIR/

				if [[ "$USE_SPLIT_BUILD" == "true" ]]; then

				    mv /tmp/$WHEELHOUSE_DIR/torch_no_python*.whl /$WHEELHOUSE_DIR/ || true

				fi

				if [[ -n "$BUILD_PYTHONLESS" ]]; then

				    mkdir -p /$LIBTORCH_HOUSE_DIR

				    mv /tmp/$LIBTORCH_HOUSE_DIR/*.zip /$LIBTORCH_HOUSE_DIR

				    rm -rf /tmp/$LIBTORCH_HOUSE_DIR

				fi

				rm -rf /tmp/$WHEELHOUSE_DIR

				rm -rf /tmp_dir

				mkdir /tmp_dir

				pushd /tmp_dir

				for pkg in /$WHEELHOUSE_DIR/torch_no_python*.whl /$WHEELHOUSE_DIR/torch*linux*.whl /$LIBTORCH_HOUSE_DIR/libtorch*.zip; do

				    # if the glob didn't match anything

				    if [[ ! -e $pkg ]]; then

				        continue

				    fi

				    rm -rf tmp

				    mkdir -p tmp

				    cd tmp

				    cp $pkg .

				    unzip -q $(basename $pkg)

				    rm -f $(basename $pkg)

				    if [[ -d torch ]]; then

				        PREFIX=torch

				    else

				        PREFIX=libtorch

				    fi

				    if [[ $pkg != *"without-deps"* ]]; then

				        # copy over needed dependent .so files over and tag them with their hash

				        patched=()

				        for filepath in "${DEPS_LIST[@]}"; do

				            filename=$(basename $filepath)

				            destpath=$PREFIX/lib/$filename

				            if [[ "$filepath" != "$destpath" ]]; then

				                cp $filepath $destpath

				            fi

				            # ROCm workaround for roctracer dlopens

				            if [[ "$DESIRED_CUDA" == *"rocm"* ]]; then

				                patchedpath=$(fname_without_so_number $destpath)

				            # Keep the so number for XPU dependencies

				            elif [[ "$DESIRED_CUDA" == *"xpu"* ]]; then

				                patchedpath=$destpath

				            else

				                patchedpath=$(fname_with_sha256 $destpath)

				            fi

				            patchedname=$(basename $patchedpath)

				            if [[ "$destpath" != "$patchedpath" ]]; then

				                mv $destpath $patchedpath

				            fi

				            patched+=("$patchedname")

				            echo "Copied $filepath to $patchedpath"

				        done

				        echo "patching to fix the so names to the hashed names"

				        for ((i=0;i<${#DEPS_LIST[@]};++i)); do

				            replace_needed_sofiles $PREFIX ${DEPS_SONAME[i]} ${patched[i]}

				            # do the same for caffe2, if it exists

				            if [[ -d caffe2 ]]; then

				                replace_needed_sofiles caffe2 ${DEPS_SONAME[i]} ${patched[i]}

				            fi

				        done

				        # copy over needed auxiliary files

				        for ((i=0;i<${#DEPS_AUX_SRCLIST[@]};++i)); do

				            srcpath=${DEPS_AUX_SRCLIST[i]}

				            dstpath=$PREFIX/${DEPS_AUX_DSTLIST[i]}

				            mkdir -p $(dirname $dstpath)

				            cp $srcpath $dstpath

				        done

				    fi

				    # set RPATH of _C.so and similar to $ORIGIN, $ORIGIN/lib

				    find $PREFIX -maxdepth 1 -type f -name "*.so*" | while read sofile; do

				        echo "Setting rpath of $sofile to ${C_SO_RPATH:-'$ORIGIN:$ORIGIN/lib'}"

				        $PATCHELF_BIN --set-rpath ${C_SO_RPATH:-'$ORIGIN:$ORIGIN/lib'} ${FORCE_RPATH:-} $sofile

				        $PATCHELF_BIN --print-rpath $sofile

				    done

				    # set RPATH of lib/ files to $ORIGIN

				    find $PREFIX/lib -maxdepth 1 -type f -name "*.so*" | while read sofile; do

				        echo "Setting rpath of $sofile to ${LIB_SO_RPATH:-'$ORIGIN'}"

				        $PATCHELF_BIN --set-rpath ${LIB_SO_RPATH:-'$ORIGIN'} ${FORCE_RPATH:-} $sofile

				        $PATCHELF_BIN --print-rpath $sofile

				    done

				    # create Manylinux 2_28 tag this needs to happen before regenerate the RECORD

				    if [[ $PLATFORM == "manylinux_2_28_x86_64" && $GPU_ARCH_TYPE != "cpu-s390x" && $GPU_ARCH_TYPE != "xpu" ]]; then

				        wheel_file=$(echo $(basename $pkg) | sed -e 's/-cp.*$/.dist-info\/WHEEL/g')

				        sed -i -e s#linux_x86_64#"${PLATFORM}"# $wheel_file;

				    fi

				    # regenerate the RECORD file with new hashes

				    record_file=$(echo $(basename $pkg) | sed -e 's/-cp.*$/.dist-info\/RECORD/g')

				    if [[ -e $record_file ]]; then

				        echo "Generating new record file $record_file"

				        : > "$record_file"

				        # generate records for folders in wheel

				        find * -type f | while read fname; do

				            make_wheel_record "$fname" >>"$record_file"

				        done

				    fi

				    if [[ $BUILD_DEBUG_INFO == "1" ]]; then

				        pushd "$PREFIX/lib"

				        # Duplicate library into debug lib

				        cp libtorch_cpu.so libtorch_cpu.so.dbg

				        # Keep debug symbols on debug lib

				        strip --only-keep-debug libtorch_cpu.so.dbg

				        # Remove debug info from release lib

				        strip --strip-debug libtorch_cpu.so

				        objcopy libtorch_cpu.so --add-gnu-debuglink=libtorch_cpu.so.dbg

				        # Zip up debug info

				        mkdir -p /tmp/debug

				        mv libtorch_cpu.so.dbg /tmp/debug/libtorch_cpu.so.dbg

				        CRC32=$(objcopy --dump-section .gnu_debuglink=>(tail -c4 | od -t x4 -An | xargs echo) libtorch_cpu.so)

				        pushd /tmp

				        PKG_NAME=$(basename "$pkg" | sed 's/\.whl$//g')

				        zip /tmp/debug-whl-libtorch-"$PKG_NAME"-"$CRC32".zip /tmp/debug/libtorch_cpu.so.dbg

				        cp /tmp/debug-whl-libtorch-"$PKG_NAME"-"$CRC32".zip "$PYTORCH_FINAL_PACKAGE_DIR"

				        popd

				        popd

				    fi

				    # Rename wheel for Manylinux 2_28

				    if [[ $PLATFORM == "manylinux_2_28_x86_64" && $GPU_ARCH_TYPE != "cpu-s390x" && $GPU_ARCH_TYPE != "xpu" ]]; then

				        pkg_name=$(echo $(basename $pkg) | sed -e s#linux_x86_64#"${PLATFORM}"#)

				        zip -rq $pkg_name $PREIX*

				        rm -f $pkg

				        mv $pkg_name $(dirname $pkg)/$pkg_name

				    else

				        # zip up the wheel back

				        zip -rq $(basename $pkg) $PREIX*

				        # remove original wheel

				        rm -f $pkg

				        mv $(basename $pkg) $pkg

				    fi

				    cd ..

				    rm -rf tmp

				done

				# Copy wheels to host machine for persistence before testing

				if [[ -n "$PYTORCH_FINAL_PACKAGE_DIR" ]]; then

				    mkdir -p "$PYTORCH_FINAL_PACKAGE_DIR" || true

				    if [[ -n "$BUILD_PYTHONLESS" ]]; then

				        cp /$LIBTORCH_HOUSE_DIR/libtorch*.zip "$PYTORCH_FINAL_PACKAGE_DIR"

				    else

				        cp /$WHEELHOUSE_DIR/torch*.whl "$PYTORCH_FINAL_PACKAGE_DIR"

				    fi

				fi

				# remove stuff before testing

				rm -rf /opt/rh

				if ls /usr/local/cuda* >/dev/null 2>&1; then

				    rm -rf /usr/local/cuda*

				fi

				# Test that all the wheels work

				if [[ -z "$BUILD_PYTHONLESS" ]]; then

				  export OMP_NUM_THREADS=4 # on NUMA machines this takes too long

				  pushd $PYTORCH_ROOT/test

				  # Install the wheel for this Python version

				  if [[ "$USE_SPLIT_BUILD" == "true" ]]; then

				    pip uninstall -y "$TORCH_NO_PYTHON_PACKAGE_NAME" || true

				  fi

				  pip uninstall -y "$TORCH_PACKAGE_NAME"

				  if [[ "$USE_SPLIT_BUILD" == "true" ]]; then

				    pip install "$TORCH_NO_PYTHON_PACKAGE_NAME" --no-index -f /$WHEELHOUSE_DIR --no-dependencies -v

				  fi

				  pip install "$TORCH_PACKAGE_NAME" --no-index -f /$WHEELHOUSE_DIR --no-dependencies -v

				  # Print info on the libraries installed in this wheel

				  # Rather than adjust find command to skip non-library files with an embedded *.so* in their name,

				  # since this is only for reporting purposes, we add the || true to the ldd command.

				  installed_libraries=($(find "$pydir/lib/python${py_majmin}/site-packages/torch/" -name '*.so*'))

				  echo "The wheel installed all of the libraries: ${installed_libraries[@]}"

				  for installed_lib in "${installed_libraries[@]}"; do

				      ldd "$installed_lib" || true

				  done

				  # Run the tests

				  echo "$(date) :: Running tests"

				  pushd "$PYTORCH_ROOT"

				  LD_LIBRARY_PATH=/usr/local/nvidia/lib64 \

				          "${PYTORCH_ROOT}/.ci/pytorch/run_tests.sh" manywheel "${py_majmin}" "$DESIRED_CUDA"

				  popd

				  echo "$(date) :: Finished tests"

				fi

									
										60

.ci/manywheel/build_cpu.sh
									
										Executable file
									
												View File
												
				@ -0,0 +1,60 @@

				#!/usr/bin/env bash

				set -ex

				export TH_BINARY_BUILD=1

				export USE_CUDA=0

				# Keep an array of cmake variables to add to

				if [[ -z "$CMAKE_ARGS" ]]; then

				    # These are passed to tools/build_pytorch_libs.sh::build()

				    CMAKE_ARGS=()

				fi

				if [[ -z "$EXTRA_CAFFE2_CMAKE_FLAGS" ]]; then

				    # These are passed to tools/build_pytorch_libs.sh::build_caffe2()

				    EXTRA_CAFFE2_CMAKE_FLAGS=()

				fi

				WHEELHOUSE_DIR="wheelhousecpu"

				LIBTORCH_HOUSE_DIR="libtorch_housecpu"

				if [[ -z "$PYTORCH_FINAL_PACKAGE_DIR" ]]; then

				    if [[ -z "$BUILD_PYTHONLESS" ]]; then

				        PYTORCH_FINAL_PACKAGE_DIR="/remote/wheelhousecpu"

				    else

				        PYTORCH_FINAL_PACKAGE_DIR="/remote/libtorch_housecpu"

				    fi

				fi

				mkdir -p "$PYTORCH_FINAL_PACKAGE_DIR" || true

				OS_NAME=$(awk -F= '/^NAME/{print $2}' /etc/os-release)

				if [[ "$OS_NAME" == *"CentOS Linux"* ]]; then

				    LIBGOMP_PATH="/usr/lib64/libgomp.so.1"

				elif [[ "$OS_NAME" == *"Red Hat Enterprise Linux"* ]]; then

				    LIBGOMP_PATH="/usr/lib64/libgomp.so.1"

				elif [[ "$OS_NAME" == *"AlmaLinux"* ]]; then

				    LIBGOMP_PATH="/usr/lib64/libgomp.so.1"

				elif [[ "$OS_NAME" == *"Ubuntu"* ]]; then

				    if [[ "$(uname -m)" == "s390x" ]]; then

				        LIBGOMP_PATH="/usr/lib/s390x-linux-gnu/libgomp.so.1"

				    else

				        LIBGOMP_PATH="/usr/lib/x86_64-linux-gnu/libgomp.so.1"

				    fi

				fi

				DEPS_LIST=(

				    "$LIBGOMP_PATH"

				)

				DEPS_SONAME=(

				    "libgomp.so.1"

				)

				rm -rf /usr/local/cuda*

				SOURCE_DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" >/dev/null && pwd )"

				if [[ -z "$BUILD_PYTHONLESS" ]]; then

				    BUILD_SCRIPT=build_common.sh

				else

				    BUILD_SCRIPT=build_libtorch.sh

				fi

				source ${SOURCE_DIR}/${BUILD_SCRIPT}

									
										292

.ci/manywheel/build_cuda.sh
									
										Normal file
									
												View File
												
				@ -0,0 +1,292 @@

				#!/usr/bin/env bash

				set -ex

				SCRIPTPATH="$( cd "$(dirname "$0")" ; pwd -P ))"

				export TORCH_NVCC_FLAGS="-Xfatbin -compress-all"

				export NCCL_ROOT_DIR=/usr/local/cuda

				export TH_BINARY_BUILD=1

				export USE_STATIC_CUDNN=1

				export USE_STATIC_NCCL=1

				export ATEN_STATIC_CUDA=1

				export USE_CUDA_STATIC_LINK=1

				export INSTALL_TEST=0 # dont install test binaries into site-packages

				export USE_CUPTI_SO=0

				export USE_CUSPARSELT=${USE_CUSPARSELT:-1} # Enable if not disabled by libtorch build

				# Keep an array of cmake variables to add to

				if [[ -z "$CMAKE_ARGS" ]]; then

				    # These are passed to tools/build_pytorch_libs.sh::build()

				    CMAKE_ARGS=()

				fi

				if [[ -z "$EXTRA_CAFFE2_CMAKE_FLAGS" ]]; then

				    # These are passed to tools/build_pytorch_libs.sh::build_caffe2()

				    EXTRA_CAFFE2_CMAKE_FLAGS=()

				fi

				# Determine CUDA version and architectures to build for

				#

				# NOTE: We should first check `DESIRED_CUDA` when determining `CUDA_VERSION`,

				# because in some cases a single Docker image can have multiple CUDA versions

				# on it, and `nvcc --version` might not show the CUDA version we want.

				if [[ -n "$DESIRED_CUDA" ]]; then

				    # If the DESIRED_CUDA already matches the format that we expect

				    if [[ ${DESIRED_CUDA} =~ ^[0-9]+\.[0-9]+$ ]]; then

				        CUDA_VERSION=${DESIRED_CUDA}

				    else

				        # cu90, cu92, cu100, cu101

				        if [[ ${#DESIRED_CUDA} -eq 4 ]]; then

				            CUDA_VERSION="${DESIRED_CUDA:2:1}.${DESIRED_CUDA:3:1}"

				        elif [[ ${#DESIRED_CUDA} -eq 5 ]]; then

				            CUDA_VERSION="${DESIRED_CUDA:2:2}.${DESIRED_CUDA:4:1}"

				        fi

				    fi

				    echo "Using CUDA $CUDA_VERSION as determined by DESIRED_CUDA"

				else

				    CUDA_VERSION=$(nvcc --version|grep release|cut -f5 -d" "|cut -f1 -d",")

				    echo "CUDA $CUDA_VERSION Detected"

				fi

				cuda_version_nodot=$(echo $CUDA_VERSION | tr -d '.')

				TORCH_CUDA_ARCH_LIST="5.0;6.0;7.0;7.5;8.0;8.6"

				case ${CUDA_VERSION} in

				    12.6)

				        if [[ "$GPU_ARCH_TYPE" = "cuda-aarch64" ]]; then

				            TORCH_CUDA_ARCH_LIST="9.0"

				        else

				            TORCH_CUDA_ARCH_LIST="${TORCH_CUDA_ARCH_LIST};9.0+PTX"

				        fi

				        EXTRA_CAFFE2_CMAKE_FLAGS+=("-DATEN_NO_TEST=ON")

				        ;;

				    12.4)

				        if [[ "$GPU_ARCH_TYPE" = "cuda-aarch64" ]]; then

				            TORCH_CUDA_ARCH_LIST="9.0"

				        else

				            TORCH_CUDA_ARCH_LIST="${TORCH_CUDA_ARCH_LIST};9.0"

				        fi

				        EXTRA_CAFFE2_CMAKE_FLAGS+=("-DATEN_NO_TEST=ON")

				        ;;

				    12.1)

				        TORCH_CUDA_ARCH_LIST="${TORCH_CUDA_ARCH_LIST};9.0"

				        EXTRA_CAFFE2_CMAKE_FLAGS+=("-DATEN_NO_TEST=ON")

				        ;;

				    11.8)

				        TORCH_CUDA_ARCH_LIST="${TORCH_CUDA_ARCH_LIST};3.7;9.0"

				        EXTRA_CAFFE2_CMAKE_FLAGS+=("-DATEN_NO_TEST=ON")

				        ;;

				    *)

				        echo "unknown cuda version $CUDA_VERSION"

				        exit 1

				        ;;

				esac

				export TORCH_CUDA_ARCH_LIST=${TORCH_CUDA_ARCH_LIST}

				echo "${TORCH_CUDA_ARCH_LIST}"

				# Package directories

				WHEELHOUSE_DIR="wheelhouse$cuda_version_nodot"

				LIBTORCH_HOUSE_DIR="libtorch_house$cuda_version_nodot"

				if [[ -z "$PYTORCH_FINAL_PACKAGE_DIR" ]]; then

				    if [[ -z "$BUILD_PYTHONLESS" ]]; then

				        PYTORCH_FINAL_PACKAGE_DIR="/remote/wheelhouse$cuda_version_nodot"

				    else

				        PYTORCH_FINAL_PACKAGE_DIR="/remote/libtorch_house$cuda_version_nodot"

				    fi

				fi

				mkdir -p "$PYTORCH_FINAL_PACKAGE_DIR" || true

				OS_NAME=$(awk -F= '/^NAME/{print $2}' /etc/os-release)

				if [[ "$OS_NAME" == *"CentOS Linux"* ]]; then

				    LIBGOMP_PATH="/usr/lib64/libgomp.so.1"

				elif [[ "$OS_NAME" == *"AlmaLinux"* ]]; then

				    LIBGOMP_PATH="/usr/lib64/libgomp.so.1"

				elif [[ "$OS_NAME" == *"Red Hat Enterprise Linux"* ]]; then

				    LIBGOMP_PATH="/usr/lib64/libgomp.so.1"

				elif [[ "$OS_NAME" == *"Ubuntu"* ]]; then

				    LIBGOMP_PATH="/usr/lib/x86_64-linux-gnu/libgomp.so.1"

				fi

				DEPS_LIST=(

				    "$LIBGOMP_PATH"

				)

				DEPS_SONAME=(

				    "libgomp.so.1"

				)

				# CUDA 11.8 have to ship the libcusparseLt.so.0 with the binary

				# since nvidia-cusparselt-cu11 is not available in PYPI

				if [[ $USE_CUSPARSELT == "1" && $CUDA_VERSION == "11.8" ]]; then

				        DEPS_SONAME+=(

				            "libcusparseLt.so.0"

				        )

				        DEPS_LIST+=(

				            "/usr/local/cuda/lib64/libcusparseLt.so.0"

				        )

				fi

				if [[ $CUDA_VERSION == "12.4" || $CUDA_VERSION == "12.6" ]]; then

				    export USE_STATIC_CUDNN=0

				    # Try parallelizing nvcc as well

				    export TORCH_NVCC_FLAGS="-Xfatbin -compress-all --threads 2"

				    if [[ -z "$PYTORCH_EXTRA_INSTALL_REQUIREMENTS" ]]; then

				        echo "Bundling with cudnn and cublas."

				        DEPS_LIST+=(

				            "/usr/local/cuda/lib64/libcudnn_adv.so.9"

				            "/usr/local/cuda/lib64/libcudnn_cnn.so.9"

				            "/usr/local/cuda/lib64/libcudnn_graph.so.9"

				            "/usr/local/cuda/lib64/libcudnn_ops.so.9"

				            "/usr/local/cuda/lib64/libcudnn_engines_runtime_compiled.so.9"

				            "/usr/local/cuda/lib64/libcudnn_engines_precompiled.so.9"

				            "/usr/local/cuda/lib64/libcudnn_heuristic.so.9"

				            "/usr/local/cuda/lib64/libcudnn.so.9"

				            "/usr/local/cuda/lib64/libcublas.so.12"

				            "/usr/local/cuda/lib64/libcublasLt.so.12"

				            "/usr/local/cuda/lib64/libcusparseLt.so.0"

				            "/usr/local/cuda/lib64/libcudart.so.12"

				            "/usr/local/cuda/lib64/libnvToolsExt.so.1"

				            "/usr/local/cuda/lib64/libnvrtc.so.12"

				            "/usr/local/cuda/lib64/libnvrtc-builtins.so"

				        )

				        DEPS_SONAME+=(

				            "libcudnn_adv.so.9"

				            "libcudnn_cnn.so.9"

				            "libcudnn_graph.so.9"

				            "libcudnn_ops.so.9"

				            "libcudnn_engines_runtime_compiled.so.9"

				            "libcudnn_engines_precompiled.so.9"

				            "libcudnn_heuristic.so.9"

				            "libcudnn.so.9"

				            "libcublas.so.12"

				            "libcublasLt.so.12"

				            "libcusparseLt.so.0"

				            "libcudart.so.12"

				            "libnvToolsExt.so.1"

				            "libnvrtc.so.12"

				            "libnvrtc-builtins.so"

				        )

				    else

				        echo "Using nvidia libs from pypi."

				        CUDA_RPATHS=(

				            '$ORIGIN/../../nvidia/cublas/lib'

				            '$ORIGIN/../../nvidia/cuda_cupti/lib'

				            '$ORIGIN/../../nvidia/cuda_nvrtc/lib'

				            '$ORIGIN/../../nvidia/cuda_runtime/lib'

				            '$ORIGIN/../../nvidia/cudnn/lib'

				            '$ORIGIN/../../nvidia/cufft/lib'

				            '$ORIGIN/../../nvidia/curand/lib'

				            '$ORIGIN/../../nvidia/cusolver/lib'

				            '$ORIGIN/../../nvidia/cusparse/lib'

				            '$ORIGIN/../../cusparselt/lib'

				            '$ORIGIN/../../nvidia/nccl/lib'

				            '$ORIGIN/../../nvidia/nvtx/lib'

				        )

				        CUDA_RPATHS=$(IFS=: ; echo "${CUDA_RPATHS[*]}")

				        export C_SO_RPATH=$CUDA_RPATHS':$ORIGIN:$ORIGIN/lib'

				        export LIB_SO_RPATH=$CUDA_RPATHS':$ORIGIN'

				        export FORCE_RPATH="--force-rpath"

				        export USE_STATIC_NCCL=0

				        export USE_SYSTEM_NCCL=1

				        export ATEN_STATIC_CUDA=0

				        export USE_CUDA_STATIC_LINK=0

				        export USE_CUPTI_SO=1

				        export NCCL_INCLUDE_DIR="/usr/local/cuda/include/"

				        export NCCL_LIB_DIR="/usr/local/cuda/lib64/"

				    fi

				elif [[ $CUDA_VERSION == "11.8" ]]; then

				    export USE_STATIC_CUDNN=0

				    # Try parallelizing nvcc as well

				    export TORCH_NVCC_FLAGS="-Xfatbin -compress-all --threads 2"

				    # Bundle ptxas into the wheel, see https://github.com/pytorch/pytorch/pull/119750

				    export BUILD_BUNDLE_PTXAS=1

				    if [[ -z "$PYTORCH_EXTRA_INSTALL_REQUIREMENTS" ]]; then

				        echo "Bundling with cudnn and cublas."

				        DEPS_LIST+=(

				            "/usr/local/cuda/lib64/libcudnn_adv.so.9"

				            "/usr/local/cuda/lib64/libcudnn_cnn.so.9"

				            "/usr/local/cuda/lib64/libcudnn_graph.so.9"

				            "/usr/local/cuda/lib64/libcudnn_ops.so.9"

				            "/usr/local/cuda/lib64/libcudnn_engines_runtime_compiled.so.9"

				            "/usr/local/cuda/lib64/libcudnn_engines_precompiled.so.9"

				            "/usr/local/cuda/lib64/libcudnn_heuristic.so.9"

				            "/usr/local/cuda/lib64/libcudnn.so.9"

				            "/usr/local/cuda/lib64/libcublas.so.11"

				            "/usr/local/cuda/lib64/libcublasLt.so.11"

				            "/usr/local/cuda/lib64/libcudart.so.11.0"

				            "/usr/local/cuda/lib64/libnvToolsExt.so.1"

				            "/usr/local/cuda/lib64/libnvrtc.so.11.2"    # this is not a mistake, it links to more specific cuda version

				            "/usr/local/cuda/lib64/libnvrtc-builtins.so.11.8"

				        )

				        DEPS_SONAME+=(

				            "libcudnn_adv.so.9"

				            "libcudnn_cnn.so.9"

				            "libcudnn_graph.so.9"

				            "libcudnn_ops.so.9"

				            "libcudnn_engines_runtime_compiled.so.9"

				            "libcudnn_engines_precompiled.so.9"

				            "libcudnn_heuristic.so.9"

				            "libcudnn.so.9"

				            "libcublas.so.11"

				            "libcublasLt.so.11"

				            "libcudart.so.11.0"

				            "libnvToolsExt.so.1"

				            "libnvrtc.so.11.2"

				            "libnvrtc-builtins.so.11.8"

				        )

				    else

				        echo "Using nvidia libs from pypi."

				        CUDA_RPATHS=(

				            '$ORIGIN/../../nvidia/cublas/lib'

				            '$ORIGIN/../../nvidia/cuda_cupti/lib'

				            '$ORIGIN/../../nvidia/cuda_nvrtc/lib'

				            '$ORIGIN/../../nvidia/cuda_runtime/lib'

				            '$ORIGIN/../../nvidia/cudnn/lib'

				            '$ORIGIN/../../nvidia/cufft/lib'

				            '$ORIGIN/../../nvidia/curand/lib'

				            '$ORIGIN/../../nvidia/cusolver/lib'

				            '$ORIGIN/../../nvidia/cusparse/lib'

				            '$ORIGIN/../../nvidia/nccl/lib'

				            '$ORIGIN/../../nvidia/nvtx/lib'

				        )

				        CUDA_RPATHS=$(IFS=: ; echo "${CUDA_RPATHS[*]}")

				        export C_SO_RPATH=$CUDA_RPATHS':$ORIGIN:$ORIGIN/lib'

				        export LIB_SO_RPATH=$CUDA_RPATHS':$ORIGIN'

				        export FORCE_RPATH="--force-rpath"

				        export USE_STATIC_NCCL=0

				        export USE_SYSTEM_NCCL=1

				        export ATEN_STATIC_CUDA=0

				        export USE_CUDA_STATIC_LINK=0

				        export USE_CUPTI_SO=1

				        export NCCL_INCLUDE_DIR="/usr/local/cuda/include/"

				        export NCCL_LIB_DIR="/usr/local/cuda/lib64/"

				    fi

				else

				    echo "Unknown cuda version $CUDA_VERSION"

				    exit 1

				fi

				# run_tests.sh requires DESIRED_CUDA to know what tests to exclude

				export DESIRED_CUDA="$cuda_version_nodot"

				# Switch `/usr/local/cuda` to the desired CUDA version

				rm -rf /usr/local/cuda || true

				ln -s "/usr/local/cuda-${CUDA_VERSION}" /usr/local/cuda

				# Switch `/usr/local/magma` to the desired CUDA version

				rm -rf /usr/local/magma || true

				ln -s /usr/local/cuda-${CUDA_VERSION}/magma /usr/local/magma

				export CUDA_VERSION=$(ls /usr/local/cuda/lib64/libcudart.so.*|sort|tac | head -1 | rev | cut -d"." -f -3 | rev) # 10.0.130

				export CUDA_VERSION_SHORT=$(ls /usr/local/cuda/lib64/libcudart.so.*|sort|tac | head -1 | rev | cut -d"." -f -3 | rev | cut -f1,2 -d".") # 10.0

				export CUDNN_VERSION=$(ls /usr/local/cuda/lib64/libcudnn.so.*|sort|tac | head -1 | rev | cut -d"." -f -3 | rev)

				SCRIPTPATH="$( cd "$(dirname "$0")" ; pwd -P )"

				if [[ -z "$BUILD_PYTHONLESS" ]]; then

				    BUILD_SCRIPT=build_common.sh

				else

				    BUILD_SCRIPT=build_libtorch.sh

				fi

				source $SCRIPTPATH/${BUILD_SCRIPT}

									
										353

.ci/manywheel/build_libtorch.sh
									
										Normal file
									
												View File
												
				@ -0,0 +1,353 @@

				#!/usr/bin/env bash

				# meant to be called only from the neighboring build.sh and build_cpu.sh scripts

				set -e pipefail

				SOURCE_DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" >/dev/null && pwd )"

				# Require only one python installation

				if [[ -z "$DESIRED_PYTHON" ]]; then

				    echo "Need to set DESIRED_PYTHON env variable"

				    exit 1

				fi

				if [[ -n "$BUILD_PYTHONLESS" && -z "$LIBTORCH_VARIANT" ]]; then

				    echo "BUILD_PYTHONLESS is set, so need LIBTORCH_VARIANT to also be set"

				    echo "LIBTORCH_VARIANT should be one of shared-with-deps shared-without-deps static-with-deps static-without-deps"

				    exit 1

				fi

				# Function to retry functions that sometimes timeout or have flaky failures

				retry () {

				    $*  || (sleep 1 && $*) || (sleep 2 && $*) || (sleep 4 && $*) || (sleep 8 && $*)

				}

				# TODO move this into the Docker images

				OS_NAME=`awk -F= '/^NAME/{print $2}' /etc/os-release`

				if [[ "$OS_NAME" == *"CentOS Linux"* ]]; then

				    retry yum install -q -y zip openssl

				elif [[ "$OS_NAME" == *"AlmaLinux"* ]]; then

				    retry yum install -q -y zip openssl

				elif [[ "$OS_NAME" == *"Red Hat Enterprise Linux"* ]]; then

				    retry dnf install -q -y zip openssl

				elif [[ "$OS_NAME" == *"Ubuntu"* ]]; then

				    # TODO: Remove this once nvidia package repos are back online

				    # Comment out nvidia repositories to prevent them from getting apt-get updated, see https://github.com/pytorch/pytorch/issues/74968

				    # shellcheck disable=SC2046

				    sed -i 's/.*nvidia.*/# &/' $(find /etc/apt/ -type f -name "*.list")

				    retry apt-get update

				    retry apt-get -y install zip openssl

				fi

				# Version: setup.py uses $PYTORCH_BUILD_VERSION.post$PYTORCH_BUILD_NUMBER if

				# PYTORCH_BUILD_NUMBER > 1

				build_version="$PYTORCH_BUILD_VERSION"

				build_number="$PYTORCH_BUILD_NUMBER"

				if [[ -n "$OVERRIDE_PACKAGE_VERSION" ]]; then

				    # This will be the *exact* version, since build_number<1

				    build_version="$OVERRIDE_PACKAGE_VERSION"

				    build_number=0

				fi

				if [[ -z "$build_version" ]]; then

				    build_version=1.0.0

				fi

				if [[ -z "$build_number" ]]; then

				    build_number=1

				fi

				export PYTORCH_BUILD_VERSION=$build_version

				export PYTORCH_BUILD_NUMBER=$build_number

				export CMAKE_LIBRARY_PATH="/opt/intel/lib:/lib:$CMAKE_LIBRARY_PATH"

				export CMAKE_INCLUDE_PATH="/opt/intel/include:$CMAKE_INCLUDE_PATH"

				# set OPENSSL_ROOT_DIR=/opt/openssl if it exists

				if [[ -e /opt/openssl ]]; then

				    export OPENSSL_ROOT_DIR=/opt/openssl

				    export CMAKE_INCLUDE_PATH="/opt/openssl/include":$CMAKE_INCLUDE_PATH

				fi

				# If given a python version like 3.6m or 2.7mu, convert this to the format we

				# expect. The binary CI jobs pass in python versions like this; they also only

				# ever pass one python version, so we assume that DESIRED_PYTHON is not a list

				# in this case

				if [[ -n "$DESIRED_PYTHON" && "$DESIRED_PYTHON" != cp* ]]; then

				    python_nodot="$(echo $DESIRED_PYTHON | tr -d m.u)"

				    DESIRED_PYTHON="cp${python_nodot}-cp${python_nodot}"

				fi

				pydir="/opt/python/$DESIRED_PYTHON"

				export PATH="$pydir/bin:$PATH"

				export PATCHELF_BIN=/usr/local/bin/patchelf

				patchelf_version=`$PATCHELF_BIN --version`

				echo "patchelf version: " $patchelf_version

				if [[ "$patchelf_version" == "patchelf 0.9" ]]; then

				    echo "Your patchelf version is too old. Please use version >= 0.10."

				    exit 1

				fi

				########################################################

				# Compile wheels as well as libtorch

				#######################################################

				if [[ -z "$PYTORCH_ROOT" ]]; then

				    echo "Need to set PYTORCH_ROOT env variable"

				    exit 1

				fi

				pushd "$PYTORCH_ROOT"

				python setup.py clean

				retry pip install -qr requirements.txt

				retry pip install -q numpy==2.0.1

				if [[ "$DESIRED_DEVTOOLSET" == *"cxx11-abi"* ]]; then

				    export _GLIBCXX_USE_CXX11_ABI=1

				else

				    export _GLIBCXX_USE_CXX11_ABI=0

				fi

				if [[ "$DESIRED_CUDA" == *"rocm"* ]]; then

				    echo "Calling build_amd.py at $(date)"

				    python tools/amd_build/build_amd.py

				    # TODO remove this work-around once pytorch sources are updated

				    export ROCclr_DIR=/opt/rocm/rocclr/lib/cmake/rocclr

				fi

				echo "Calling setup.py install at $(date)"

				if [[ $LIBTORCH_VARIANT = *"static"* ]]; then

				    STATIC_CMAKE_FLAG="-DTORCH_STATIC=1"

				fi

				(

				    set -x

				    mkdir -p build

				    time CMAKE_ARGS=${CMAKE_ARGS[@]} \

				        EXTRA_CAFFE2_CMAKE_FLAGS="${EXTRA_CAFFE2_CMAKE_FLAGS[@]} $STATIC_CMAKE_FLAG" \

				        # TODO: Remove this flag once https://github.com/pytorch/pytorch/issues/55952 is closed

				        CFLAGS='-Wno-deprecated-declarations' \

				        BUILD_LIBTORCH_CPU_WITH_DEBUG=1 \

				        python setup.py install

				    mkdir -p libtorch/{lib,bin,include,share}

				    # Make debug folder separate so it doesn't get zipped up with the rest of

				    # libtorch

				    mkdir debug

				    # Copy over all lib files

				    cp -rv build/lib/*                libtorch/lib/

				    cp -rv build/lib*/torch/lib/*     libtorch/lib/

				    # Copy over all include files

				    cp -rv build/include/*            libtorch/include/

				    cp -rv build/lib*/torch/include/* libtorch/include/

				    # Copy over all of the cmake files

				    cp -rv build/lib*/torch/share/*   libtorch/share/

				    # Split libtorch into debug / release version

				    cp libtorch/lib/libtorch_cpu.so libtorch/lib/libtorch_cpu.so.dbg

				    # Keep debug symbols on debug lib

				    strip --only-keep-debug libtorch/lib/libtorch_cpu.so.dbg

				    # Remove debug info from release lib

				    strip --strip-debug libtorch/lib/libtorch_cpu.so

				    # Add a debug link to the release lib to the debug lib (debuggers will then

				    # search for symbols in a file called libtorch_cpu.so.dbg in some

				    # predetermined locations) and embed a CRC32 of the debug library into the .so

				    cd libtorch/lib

				    objcopy libtorch_cpu.so --add-gnu-debuglink=libtorch_cpu.so.dbg

				    cd ../..

				    # Move the debug symbols to its own directory so it doesn't get processed /

				    # zipped with all the other libraries

				    mv libtorch/lib/libtorch_cpu.so.dbg debug/libtorch_cpu.so.dbg

				    echo "${PYTORCH_BUILD_VERSION}" > libtorch/build-version

				    echo "$(pushd $PYTORCH_ROOT && git rev-parse HEAD)" > libtorch/build-hash

				)

				if [[ "$DESIRED_DEVTOOLSET" == *"cxx11-abi"* ]]; then

				    LIBTORCH_ABI="cxx11-abi-"

				else

				    LIBTORCH_ABI=

				fi

				(

				    set -x

				    mkdir -p /tmp/$LIBTORCH_HOUSE_DIR

				    # objcopy installs a CRC32 into libtorch_cpu above so, so add that to the name here

				    CRC32=$(objcopy --dump-section .gnu_debuglink=>(tail -c4 | od -t x4 -An | xargs echo) libtorch/lib/libtorch_cpu.so)

				    # Zip debug symbols

				    zip /tmp/$LIBTORCH_HOUSE_DIR/debug-libtorch-$LIBTORCH_ABI$LIBTORCH_VARIANT-$PYTORCH_BUILD_VERSION-$CRC32.zip debug/libtorch_cpu.so.dbg

				    # Zip and copy libtorch

				    zip -rq /tmp/$LIBTORCH_HOUSE_DIR/libtorch-$LIBTORCH_ABI$LIBTORCH_VARIANT-$PYTORCH_BUILD_VERSION.zip libtorch

				    cp /tmp/$LIBTORCH_HOUSE_DIR/libtorch-$LIBTORCH_ABI$LIBTORCH_VARIANT-$PYTORCH_BUILD_VERSION.zip \

				       /tmp/$LIBTORCH_HOUSE_DIR/libtorch-$LIBTORCH_ABI$LIBTORCH_VARIANT-latest.zip

				)

				popd

				#######################################################################

				# ADD DEPENDENCIES INTO THE WHEEL

				#

				# auditwheel repair doesn't work correctly and is buggy

				# so manually do the work of copying dependency libs and patchelfing

				# and fixing RECORDS entries correctly

				######################################################################

				fname_with_sha256() {

				    HASH=$(sha256sum $1 | cut -c1-8)

				    DIRNAME=$(dirname $1)

				    BASENAME=$(basename $1)

				    if [[ $BASENAME == "libnvrtc-builtins.so" || $BASENAME == "libcudnn"* ]]; then

				        echo $1

				    else

				        INITNAME=$(echo $BASENAME | cut -f1 -d".")

				        ENDNAME=$(echo $BASENAME | cut -f 2- -d".")

				        echo "$DIRNAME/$INITNAME-$HASH.$ENDNAME"

				    fi

				}

				fname_without_so_number() {

				    LINKNAME=$(echo $1 | sed -e 's/\.so.*/.so/g')

				    echo "$LINKNAME"

				}

				make_wheel_record() {

				    FPATH=$1

				    if echo $FPATH | grep RECORD >/dev/null 2>&1; then

				        # if the RECORD file, then

				        echo "\"$FPATH\",,"

				    else

				        HASH=$(openssl dgst -sha256 -binary $FPATH | openssl base64 | sed -e 's/+/-/g' | sed -e 's/\//_/g' | sed -e 's/=//g')

				        FSIZE=$(ls -nl $FPATH | awk '{print $5}')

				        echo "\"$FPATH\",sha256=$HASH,$FSIZE"

				    fi

				}

				echo 'Built this package:'

				(

				    set -x

				    mkdir -p /$LIBTORCH_HOUSE_DIR

				    mv /tmp/$LIBTORCH_HOUSE_DIR/*.zip /$LIBTORCH_HOUSE_DIR

				    rm -rf /tmp/$LIBTORCH_HOUSE_DIR

				)

				TMP_DIR=$(mktemp -d)

				trap "rm -rf ${TMP_DIR}" EXIT

				pushd "${TMP_DIR}"

				for pkg in /$LIBTORCH_HOUSE_DIR/libtorch*.zip; do

				    # if the glob didn't match anything

				    if [[ ! -e $pkg ]]; then

				        continue

				    fi

				    rm -rf tmp

				    mkdir -p tmp

				    cd tmp

				    cp $pkg .

				    unzip -q $(basename $pkg)

				    rm -f $(basename $pkg)

				    PREFIX=libtorch

				    if [[ $pkg != *"without-deps"* ]]; then

				        # copy over needed dependent .so files over and tag them with their hash

				        patched=()

				        for filepath in "${DEPS_LIST[@]}"; do

				            filename=$(basename $filepath)

				            destpath=$PREFIX/lib/$filename

				            if [[ "$filepath" != "$destpath" ]]; then

				                cp $filepath $destpath

				            fi

				            if [[ "$DESIRED_CUDA" == *"rocm"* ]]; then

				                patchedpath=$(fname_without_so_number $destpath)

				            else

				                patchedpath=$(fname_with_sha256 $destpath)

				            fi

				            patchedname=$(basename $patchedpath)

				            if [[ "$destpath" != "$patchedpath" ]]; then

				                mv $destpath $patchedpath

				            fi

				            patched+=("$patchedname")

				            echo "Copied $filepath to $patchedpath"

				        done

				        echo "patching to fix the so names to the hashed names"

				        for ((i=0;i<${#DEPS_LIST[@]};++i)); do

				            find $PREFIX -name '*.so*' | while read sofile; do

				                origname=${DEPS_SONAME[i]}

				                patchedname=${patched[i]}

				                if [[ "$origname" != "$patchedname" ]] || [[ "$DESIRED_CUDA" == *"rocm"* ]]; then

				                    set +e

				                    origname=$($PATCHELF_BIN --print-needed $sofile | grep "$origname.*")

				                    ERRCODE=$?

				                    set -e

				                    if [ "$ERRCODE" -eq "0" ]; then

				                        echo "patching $sofile entry $origname to $patchedname"

				                        $PATCHELF_BIN --replace-needed $origname $patchedname $sofile

				                    fi

				                fi

				            done

				        done

				        # copy over needed auxiliary files

				        for ((i=0;i<${#DEPS_AUX_SRCLIST[@]};++i)); do

				            srcpath=${DEPS_AUX_SRCLIST[i]}

				            dstpath=$PREFIX/${DEPS_AUX_DSTLIST[i]}

				            mkdir -p $(dirname $dstpath)

				            cp $srcpath $dstpath

				        done

				    fi

				    # set RPATH of _C.so and similar to $ORIGIN, $ORIGIN/lib

				    find $PREFIX -maxdepth 1 -type f -name "*.so*" | while read sofile; do

				        echo "Setting rpath of $sofile to " '$ORIGIN:$ORIGIN/lib'

				        $PATCHELF_BIN --set-rpath '$ORIGIN:$ORIGIN/lib' $sofile

				        $PATCHELF_BIN --print-rpath $sofile

				    done

				    # set RPATH of lib/ files to $ORIGIN

				    find $PREFIX/lib -maxdepth 1 -type f -name "*.so*" | while read sofile; do

				        echo "Setting rpath of $sofile to " '$ORIGIN'

				        $PATCHELF_BIN --set-rpath '$ORIGIN' $sofile

				        $PATCHELF_BIN --print-rpath $sofile

				    done

				    # regenerate the RECORD file with new hashes

				    record_file=`echo $(basename $pkg) | sed -e 's/-cp.*$/.dist-info\/RECORD/g'`

				    if [[ -e $record_file ]]; then

				        echo "Generating new record file $record_file"

				        rm -f $record_file

				        # generate records for folders in wheel

				        find * -type f | while read fname; do

				            echo $(make_wheel_record $fname) >>$record_file

				        done

				    fi

				    # zip up the wheel back

				    zip -rq $(basename $pkg) $PREFIX*

				    # replace original wheel

				    rm -f $pkg

				    mv $(basename $pkg) $pkg

				    cd ..

				    rm -rf tmp

				done

				# Copy wheels to host machine for persistence before testing

				if [[ -n "$PYTORCH_FINAL_PACKAGE_DIR" ]]; then

				    cp /$LIBTORCH_HOUSE_DIR/libtorch*.zip "$PYTORCH_FINAL_PACKAGE_DIR"

				    cp /$LIBTORCH_HOUSE_DIR/debug-libtorch*.zip "$PYTORCH_FINAL_PACKAGE_DIR"

				fi

									
										291

.ci/manywheel/build_rocm.sh
									
										Executable file
									
												View File
												
				@ -0,0 +1,291 @@

				#!/usr/bin/env bash

				set -ex

				export ROCM_HOME=/opt/rocm

				export MAGMA_HOME=$ROCM_HOME/magma

				# TODO: libtorch_cpu.so is broken when building with Debug info

				export BUILD_DEBUG_INFO=0

				# TODO Are these all used/needed?

				export TH_BINARY_BUILD=1

				export USE_STATIC_CUDNN=1

				export USE_STATIC_NCCL=1

				export ATEN_STATIC_CUDA=1

				export USE_CUDA_STATIC_LINK=1

				export INSTALL_TEST=0 # dont install test binaries into site-packages

				# Set RPATH instead of RUNPATH when using patchelf to avoid LD_LIBRARY_PATH override

				export FORCE_RPATH="--force-rpath"

				# Keep an array of cmake variables to add to

				if [[ -z "$CMAKE_ARGS" ]]; then

				    # These are passed to tools/build_pytorch_libs.sh::build()

				    CMAKE_ARGS=()

				fi

				if [[ -z "$EXTRA_CAFFE2_CMAKE_FLAGS" ]]; then

				    # These are passed to tools/build_pytorch_libs.sh::build_caffe2()

				    EXTRA_CAFFE2_CMAKE_FLAGS=()

				fi

				# Determine ROCm version and architectures to build for

				#

				# NOTE: We should first check `DESIRED_CUDA` when determining `ROCM_VERSION`

				if [[ -n "$DESIRED_CUDA" ]]; then

				    if ! echo "${DESIRED_CUDA}"| grep "^rocm" >/dev/null 2>/dev/null; then

				        export DESIRED_CUDA="rocm${DESIRED_CUDA}"

				    fi

				    # rocm3.7, rocm3.5.1

				    ROCM_VERSION="$DESIRED_CUDA"

				    echo "Using $ROCM_VERSION as determined by DESIRED_CUDA"

				else

				    echo "Must set DESIRED_CUDA"

				    exit 1

				fi

				# Package directories

				WHEELHOUSE_DIR="wheelhouse$ROCM_VERSION"

				LIBTORCH_HOUSE_DIR="libtorch_house$ROCM_VERSION"

				if [[ -z "$PYTORCH_FINAL_PACKAGE_DIR" ]]; then

				    if [[ -z "$BUILD_PYTHONLESS" ]]; then

				        PYTORCH_FINAL_PACKAGE_DIR="/remote/wheelhouse$ROCM_VERSION"

				    else

				        PYTORCH_FINAL_PACKAGE_DIR="/remote/libtorch_house$ROCM_VERSION"

				    fi

				fi

				mkdir -p "$PYTORCH_FINAL_PACKAGE_DIR" || true

				# To make version comparison easier, create an integer representation.

				ROCM_VERSION_CLEAN=$(echo ${ROCM_VERSION} | sed s/rocm//)

				save_IFS="$IFS"

				IFS=. ROCM_VERSION_ARRAY=(${ROCM_VERSION_CLEAN})

				IFS="$save_IFS"

				if [[ ${#ROCM_VERSION_ARRAY[@]} == 2 ]]; then

				    ROCM_VERSION_MAJOR=${ROCM_VERSION_ARRAY[0]}

				    ROCM_VERSION_MINOR=${ROCM_VERSION_ARRAY[1]}

				    ROCM_VERSION_PATCH=0

				elif [[ ${#ROCM_VERSION_ARRAY[@]} == 3 ]]; then

				    ROCM_VERSION_MAJOR=${ROCM_VERSION_ARRAY[0]}

				    ROCM_VERSION_MINOR=${ROCM_VERSION_ARRAY[1]}

				    ROCM_VERSION_PATCH=${ROCM_VERSION_ARRAY[2]}

				else

				    echo "Unhandled ROCM_VERSION ${ROCM_VERSION}"

				    exit 1

				fi

				ROCM_INT=$(($ROCM_VERSION_MAJOR * 10000 + $ROCM_VERSION_MINOR * 100 + $ROCM_VERSION_PATCH))

				# Required ROCm libraries

				ROCM_SO_FILES=(

				    "libMIOpen.so"

				    "libamdhip64.so"

				    "libhipblas.so"

				    "libhipfft.so"

				    "libhiprand.so"

				    "libhipsolver.so"

				    "libhipsparse.so"

				    "libhsa-runtime64.so"

				    "libamd_comgr.so"

				    "libmagma.so"

				    "librccl.so"

				    "librocblas.so"

				    "librocfft.so"

				    "librocm_smi64.so"

				    "librocrand.so"

				    "librocsolver.so"

				    "librocsparse.so"

				    "libroctracer64.so"

				    "libroctx64.so"

				    "libhipblaslt.so"

				    "libhiprtc.so"

				)

				if [[ $ROCM_INT -ge 60100 ]]; then

				    ROCM_SO_FILES+=("librocprofiler-register.so")

				fi

				if [[ $ROCM_INT -ge 60200 ]]; then

				    ROCM_SO_FILES+=("librocm-core.so")

				fi

				OS_NAME=`awk -F= '/^NAME/{print $2}' /etc/os-release`

				if [[ "$OS_NAME" == *"CentOS Linux"* || "$OS_NAME" == *"AlmaLinux"* ]]; then

				    LIBGOMP_PATH="/usr/lib64/libgomp.so.1"

				    LIBNUMA_PATH="/usr/lib64/libnuma.so.1"

				    LIBELF_PATH="/usr/lib64/libelf.so.1"

				    if [[ "$OS_NAME" == *"CentOS Linux"* ]]; then

				        LIBTINFO_PATH="/usr/lib64/libtinfo.so.5"

				    else

				        LIBTINFO_PATH="/usr/lib64/libtinfo.so.6"

				    fi

				    LIBDRM_PATH="/opt/amdgpu/lib64/libdrm.so.2"

				    LIBDRM_AMDGPU_PATH="/opt/amdgpu/lib64/libdrm_amdgpu.so.1"

				    if [[ $ROCM_INT -ge 60100 && $ROCM_INT -lt 60300 ]]; then

				        # Below libs are direct dependencies of libhipsolver

				        LIBSUITESPARSE_CONFIG_PATH="/lib64/libsuitesparseconfig.so.4"

				        if [[ "$OS_NAME" == *"CentOS Linux"* ]]; then

				            LIBCHOLMOD_PATH="/lib64/libcholmod.so.2"

				            # Below libs are direct dependencies of libsatlas

				            LIBGFORTRAN_PATH="/lib64/libgfortran.so.3"

				        else

				            LIBCHOLMOD_PATH="/lib64/libcholmod.so.3"

				            # Below libs are direct dependencies of libsatlas

				            LIBGFORTRAN_PATH="/lib64/libgfortran.so.5"

				        fi

				        # Below libs are direct dependencies of libcholmod

				        LIBAMD_PATH="/lib64/libamd.so.2"

				        LIBCAMD_PATH="/lib64/libcamd.so.2"

				        LIBCCOLAMD_PATH="/lib64/libccolamd.so.2"

				        LIBCOLAMD_PATH="/lib64/libcolamd.so.2"

				        LIBSATLAS_PATH="/lib64/atlas/libsatlas.so.3"

				        # Below libs are direct dependencies of libsatlas

				        LIBQUADMATH_PATH="/lib64/libquadmath.so.0"

				    fi

				    MAYBE_LIB64=lib64

				elif [[ "$OS_NAME" == *"Ubuntu"* ]]; then

				    LIBGOMP_PATH="/usr/lib/x86_64-linux-gnu/libgomp.so.1"

				    LIBNUMA_PATH="/usr/lib/x86_64-linux-gnu/libnuma.so.1"

				    LIBELF_PATH="/usr/lib/x86_64-linux-gnu/libelf.so.1"

				    if [[ $ROCM_INT -ge 50300 ]]; then

				        LIBTINFO_PATH="/lib/x86_64-linux-gnu/libtinfo.so.6"

				    else

				        LIBTINFO_PATH="/lib/x86_64-linux-gnu/libtinfo.so.5"

				    fi

				    LIBDRM_PATH="/usr/lib/x86_64-linux-gnu/libdrm.so.2"

				    LIBDRM_AMDGPU_PATH="/usr/lib/x86_64-linux-gnu/libdrm_amdgpu.so.1"

				    if [[ $ROCM_INT -ge 60100 && $ROCM_INT -lt 60300 ]]; then

				        # Below libs are direct dependencies of libhipsolver

				        LIBCHOLMOD_PATH="/lib/x86_64-linux-gnu/libcholmod.so.3"

				        # Below libs are direct dependencies of libcholmod

				        LIBSUITESPARSE_CONFIG_PATH="/lib/x86_64-linux-gnu/libsuitesparseconfig.so.5"

				        LIBAMD_PATH="/lib/x86_64-linux-gnu/libamd.so.2"

				        LIBCAMD_PATH="/lib/x86_64-linux-gnu/libcamd.so.2"

				        LIBCCOLAMD_PATH="/lib/x86_64-linux-gnu/libccolamd.so.2"

				        LIBCOLAMD_PATH="/lib/x86_64-linux-gnu/libcolamd.so.2"

				        LIBMETIS_PATH="/lib/x86_64-linux-gnu/libmetis.so.5"

				        LIBLAPACK_PATH="/lib/x86_64-linux-gnu/liblapack.so.3"

				        LIBBLAS_PATH="/lib/x86_64-linux-gnu/libblas.so.3"

				        # Below libs are direct dependencies of libblas

				        LIBGFORTRAN_PATH="/lib/x86_64-linux-gnu/libgfortran.so.5"

				        LIBQUADMATH_PATH="/lib/x86_64-linux-gnu/libquadmath.so.0"

				    fi

				    MAYBE_LIB64=lib

				fi

				OS_SO_PATHS=($LIBGOMP_PATH $LIBNUMA_PATH\

				             $LIBELF_PATH $LIBTINFO_PATH\

				             $LIBDRM_PATH $LIBDRM_AMDGPU_PATH\

				             $LIBSUITESPARSE_CONFIG_PATH\

				             $LIBCHOLMOD_PATH $LIBAMD_PATH\

				             $LIBCAMD_PATH $LIBCCOLAMD_PATH\

				             $LIBCOLAMD_PATH $LIBSATLAS_PATH\

				             $LIBGFORTRAN_PATH $LIBQUADMATH_PATH\

				             $LIBMETIS_PATH $LIBLAPACK_PATH\

				             $LIBBLAS_PATH)

				OS_SO_FILES=()

				for lib in "${OS_SO_PATHS[@]}"

				do

				    file_name="${lib##*/}" # Substring removal of path to get filename

				    OS_SO_FILES[${#OS_SO_FILES[@]}]=$file_name # Append lib to array

				done

				# FIXME: Temporary until https://github.com/pytorch/pytorch/pull/137443 lands

				# Install AOTriton

				if [ -e ${PYTORCH_ROOT}/.ci/docker/aotriton_version.txt ]; then

				    cp -a ${PYTORCH_ROOT}/.ci/docker/aotriton_version.txt aotriton_version.txt

				    bash ${PYTORCH_ROOT}/.ci/docker/common/install_aotriton.sh ${ROCM_HOME} && rm aotriton_version.txt

				    export AOTRITON_INSTALLED_PREFIX=${ROCM_HOME}/aotriton

				    ROCM_SO_FILES+=("libaotriton_v2.so")

				fi

				# rocBLAS library files

				ROCBLAS_LIB_SRC=$ROCM_HOME/lib/rocblas/library

				ROCBLAS_LIB_DST=lib/rocblas/library

				ARCH=$(echo $PYTORCH_ROCM_ARCH | sed 's/;/|/g') # Replace ; seperated arch list to bar for grep

				ARCH_SPECIFIC_FILES=$(ls $ROCBLAS_LIB_SRC | grep -E $ARCH)

				OTHER_FILES=$(ls $ROCBLAS_LIB_SRC | grep -v gfx)

				ROCBLAS_LIB_FILES=($ARCH_SPECIFIC_FILES $OTHER_FILES)

				# hipblaslt library files

				HIPBLASLT_LIB_SRC=$ROCM_HOME/lib/hipblaslt/library

				HIPBLASLT_LIB_DST=lib/hipblaslt/library

				ARCH_SPECIFIC_FILES=$(ls $HIPBLASLT_LIB_SRC | grep -E $ARCH)

				OTHER_FILES=$(ls $HIPBLASLT_LIB_SRC | grep -v gfx)

				HIPBLASLT_LIB_FILES=($ARCH_SPECIFIC_FILES $OTHER_FILES)

				# ROCm library files

				ROCM_SO_PATHS=()

				for lib in "${ROCM_SO_FILES[@]}"

				do

				    file_path=($(find $ROCM_HOME/lib/ -name "$lib")) # First search in lib

				    if [[ -z $file_path ]]; then

				        if [ -d "$ROCM_HOME/lib64/" ]; then

				            file_path=($(find $ROCM_HOME/lib64/ -name "$lib")) # Then search in lib64

				        fi

				    fi

				    if [[ -z $file_path ]]; then

				        file_path=($(find $ROCM_HOME/ -name "$lib")) # Then search in ROCM_HOME

				    fi

				    if [[ -z $file_path ]]; then

				        echo "Error: Library file $lib is not found." >&2

				        exit 1

				    fi

				    ROCM_SO_PATHS[${#ROCM_SO_PATHS[@]}]="$file_path" # Append lib to array

				done

				DEPS_LIST=(

				    ${ROCM_SO_PATHS[*]}

				    ${OS_SO_PATHS[*]}

				)

				DEPS_SONAME=(

				    ${ROCM_SO_FILES[*]}

				    ${OS_SO_FILES[*]}

				)

				DEPS_AUX_SRCLIST=(

				    "${ROCBLAS_LIB_FILES[@]/#/$ROCBLAS_LIB_SRC/}"

				    "${HIPBLASLT_LIB_FILES[@]/#/$HIPBLASLT_LIB_SRC/}"

				    "/opt/amdgpu/share/libdrm/amdgpu.ids"

				)

				DEPS_AUX_DSTLIST=(

				    "${ROCBLAS_LIB_FILES[@]/#/$ROCBLAS_LIB_DST/}"

				    "${HIPBLASLT_LIB_FILES[@]/#/$HIPBLASLT_LIB_DST/}"

				    "share/libdrm/amdgpu.ids"

				)

				# MIOpen library files

				MIOPEN_SHARE_SRC=$ROCM_HOME/share/miopen/db

				MIOPEN_SHARE_DST=share/miopen/db

				MIOPEN_SHARE_FILES=($(ls $MIOPEN_SHARE_SRC | grep -E $ARCH))

				DEPS_AUX_SRCLIST+=(${MIOPEN_SHARE_FILES[@]/#/$MIOPEN_SHARE_SRC/})

				DEPS_AUX_DSTLIST+=(${MIOPEN_SHARE_FILES[@]/#/$MIOPEN_SHARE_DST/})

				# RCCL library files

				RCCL_SHARE_SRC=$ROCM_HOME/share/rccl/msccl-algorithms

				RCCL_SHARE_DST=share/rccl/msccl-algorithms

				RCCL_SHARE_FILES=($(ls $RCCL_SHARE_SRC))

				DEPS_AUX_SRCLIST+=(${RCCL_SHARE_FILES[@]/#/$RCCL_SHARE_SRC/})

				DEPS_AUX_DSTLIST+=(${RCCL_SHARE_FILES[@]/#/$RCCL_SHARE_DST/})

				# PyTorch 2.6+ (AOTriton 0.8b+)

				# AKS = "AOTriton Kernel Storage", a file format to store GPU kernels compactly

				if (( $(echo "${PYTORCH_VERSION} 2.6" | awk '{print ($1 >= $2)}') )); then

				    LIBAOTRITON_DIR=$(find "$ROCM_HOME/lib/" -name "libaotriton_v2.so" -printf '%h\n')

				    if [[ -z ${LIBAOTRITON_DIR} ]]; then

				        LIBAOTRITON_DIR=$(find "$ROCM_HOME/" -name "libaotriton_v2.so" -printf '%h\n')

				    fi

				    AKS_FILES=($(find "${LIBAOTRITON_DIR}/aotriton.images" -type f -name '*.aks?' -printf '%P\n'))

				    AKS_SRC="${LIBAOTRITON_DIR}/aotriton.images"

				    AKS_DST="lib/aotriton.images"

				    DEPS_AUX_SRCLIST+=(${AKS_FILES[@]/#/${AKS_SRC}/})

				    DEPS_AUX_DSTLIST+=(${AKS_FILES[@]/#/${AKS_DST}/})

				fi

				echo "PYTORCH_ROCM_ARCH: ${PYTORCH_ROCM_ARCH}"

				SCRIPTPATH="$( cd "$(dirname "$0")" ; pwd -P )"

				if [[ -z "$BUILD_PYTHONLESS" ]]; then

				    BUILD_SCRIPT=build_common.sh

				else

				    BUILD_SCRIPT=build_libtorch.sh

				fi

				source $SCRIPTPATH/${BUILD_SCRIPT}

									
										108

.ci/manywheel/build_xpu.sh
									
										Executable file
									
												View File
												
				@ -0,0 +1,108 @@

				#!/usr/bin/env bash

				set -ex

				export TH_BINARY_BUILD=1

				export USE_CUDA=0

				# Keep an array of cmake variables to add to

				if [[ -z "$CMAKE_ARGS" ]]; then

				    # These are passed to tools/build_pytorch_libs.sh::build()

				    CMAKE_ARGS=()

				fi

				if [[ -z "$EXTRA_CAFFE2_CMAKE_FLAGS" ]]; then

				    # These are passed to tools/build_pytorch_libs.sh::build_caffe2()

				    EXTRA_CAFFE2_CMAKE_FLAGS=()

				fi

				# Refer https://www.intel.com/content/www/us/en/developer/articles/tool/pytorch-prerequisites-for-intel-gpus.html

				source /opt/intel/oneapi/compiler/latest/env/vars.sh

				source /opt/intel/oneapi/pti/latest/env/vars.sh

				source /opt/intel/oneapi/umf/latest/env/vars.sh

				export USE_STATIC_MKL=1

				WHEELHOUSE_DIR="wheelhousexpu"

				LIBTORCH_HOUSE_DIR="libtorch_housexpu"

				if [[ -z "$PYTORCH_FINAL_PACKAGE_DIR" ]]; then

				    if [[ -z "$BUILD_PYTHONLESS" ]]; then

				        PYTORCH_FINAL_PACKAGE_DIR="/remote/wheelhousexpu"

				    else

				        PYTORCH_FINAL_PACKAGE_DIR="/remote/libtorch_housexpu"

				    fi

				fi

				mkdir -p "$PYTORCH_FINAL_PACKAGE_DIR" || true

				OS_NAME=$(awk -F= '/^NAME/{print $2}' /etc/os-release)

				if [[ "$OS_NAME" == *"CentOS Linux"* ]]; then

				    LIBGOMP_PATH="/usr/lib64/libgomp.so.1"

				elif [[ "$OS_NAME" == *"Red Hat Enterprise Linux"* ]]; then

				    LIBGOMP_PATH="/usr/lib64/libgomp.so.1"

				elif [[ "$OS_NAME" == *"AlmaLinux"* ]]; then

				    LIBGOMP_PATH="/usr/lib64/libgomp.so.1"

				elif [[ "$OS_NAME" == *"Ubuntu"* ]]; then

				    if [[ "$(uname -m)" == "s390x" ]]; then

				        LIBGOMP_PATH="/usr/lib/s390x-linux-gnu/libgomp.so.1"

				    else

				        LIBGOMP_PATH="/usr/lib/x86_64-linux-gnu/libgomp.so.1"

				    fi

				fi

				DEPS_LIST=(

				    "$LIBGOMP_PATH"

				    "/opt/intel/oneapi/compiler/latest/lib/libOpenCL.so.1"

				)

				DEPS_SONAME=(

				    "libgomp.so.1"

				    "libOpenCL.so.1"

				)

				if [[ -z "$PYTORCH_EXTRA_INSTALL_REQUIREMENTS" ]]; then

				    echo "Bundling with xpu support package libs."

				    DEPS_LIST+=(

				        "/opt/intel/oneapi/compiler/latest/lib/libsycl.so.8"

				        "/opt/intel/oneapi/compiler/latest/lib/libur_loader.so.0"

				        "/opt/intel/oneapi/compiler/latest/lib/libur_adapter_level_zero.so.0"

				        "/opt/intel/oneapi/compiler/latest/lib/libur_adapter_opencl.so.0"

				        "/opt/intel/oneapi/compiler/latest/lib/libsvml.so"

				        "/opt/intel/oneapi/compiler/latest/lib/libirng.so"

				        "/opt/intel/oneapi/compiler/latest/lib/libimf.so"

				        "/opt/intel/oneapi/compiler/latest/lib/libintlc.so.5"

				        "/opt/intel/oneapi/pti/latest/lib/libpti_view.so.0.10"

				        "/opt/intel/oneapi/umf/latest/lib/libumf.so.0"

				        "/opt/intel/oneapi/tcm/latest/lib/libhwloc.so.15"

				    )

				    DEPS_SONAME+=(

				        "libsycl.so.8"

				        "libur_loader.so.0"

				        "libur_adapter_level_zero.so.0"

				        "libur_adapter_opencl.so.0"

				        "libsvml.so"

				        "libirng.so"

				        "libimf.so"

				        "libintlc.so.5"

				        "libpti_view.so.0.10"

				        "libumf.so.0"

				        "libhwloc.so.15"

				    )

				else

				    echo "Using xpu runtime libs from pypi."

				    XPU_RPATHS=(

				        '$ORIGIN/../../../..'

				    )

				    XPU_RPATHS=$(IFS=: ; echo "${XPU_RPATHS[*]}")

				    export C_SO_RPATH=$XPU_RPATHS':$ORIGIN:$ORIGIN/lib'

				    export LIB_SO_RPATH=$XPU_RPATHS':$ORIGIN'

				    export FORCE_RPATH="--force-rpath"

				fi

				rm -rf /usr/local/cuda*

				SOURCE_DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" >/dev/null && pwd )"

				if [[ -z "$BUILD_PYTHONLESS" ]]; then

				    BUILD_SCRIPT=build_common.sh

				else

				    BUILD_SCRIPT=build_libtorch.sh

				fi

				source ${SOURCE_DIR}/${BUILD_SCRIPT}

									
										30

.ci/manywheel/set_desired_python.sh
									
										Normal file
									
												View File
												
				@ -0,0 +1,30 @@

				#!/usr/bin/env bash

				# Require only one python installation

				if [[ -z "$DESIRED_PYTHON" ]]; then

				    echo "Need to set DESIRED_PYTHON env variable"

				    exit 1

				fi

				# If given a python version like 3.6m or 2.7mu, convert this to the format we

				# expect. The binary CI jobs pass in python versions like this; they also only

				# ever pass one python version, so we assume that DESIRED_PYTHON is not a list

				# in this case

				if [[ -n "$DESIRED_PYTHON" && $DESIRED_PYTHON =~ ([0-9].[0-9]+)t ]]; then

				    python_digits="$(echo $DESIRED_PYTHON | tr -cd [:digit:])"

				    py_majmin="${DESIRED_PYTHON}"

				    DESIRED_PYTHON="cp${python_digits}-cp${python_digits}t"

				elif [[ -n "$DESIRED_PYTHON" && "$DESIRED_PYTHON" != cp* ]]; then

				    python_nodot="$(echo $DESIRED_PYTHON | tr -d m.u)"

				    DESIRED_PYTHON="cp${python_nodot}-cp${python_nodot}"

				    if [[ ${python_nodot} -ge 310 ]]; then

				        py_majmin="${DESIRED_PYTHON:2:1}.${DESIRED_PYTHON:3:2}"

				    else

				        py_majmin="${DESIRED_PYTHON:2:1}.${DESIRED_PYTHON:3:1}"

				    fi

				fi

				pydir="/opt/python/$DESIRED_PYTHON"

				export DESIRED_PYTHON_BIN_DIR="${pydir}/bin"

				export PATH="$DESIRED_PYTHON_BIN_DIR:$PATH"

				echo "Will build for Python version: ${DESIRED_PYTHON}"

									
										26

.ci/manywheel/test_wheel.sh
									
										Executable file
									
												View File
												
				@ -0,0 +1,26 @@

				#!/usr/bin/env bash

				set -e

				yum install -y wget git

				rm -rf /usr/local/cuda*

				# Install Anaconda

				if ! ls /py

				then

				    echo "Miniconda needs to be installed"

				    wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh -O ~/miniconda.sh

				    bash ~/miniconda.sh -b -p /py

				else

				    echo "Miniconda is already installed"

				fi

				export PATH="/py/bin:$PATH"

				# Anaconda token

				if ls /remote/token

				then

				   source /remote/token

				fi

				conda install -y conda-build anaconda-client

									
										55

.ci/pytorch/build.sh
									
												View File
												
				@ -1,6 +1,6 @@

				#!/bin/bash

				set -ex

				set -ex -o pipefail

				# Required environment variable: $BUILD_ENVIRONMENT

				# (This is set by default in the Docker images we build, so you don't

				@ -49,13 +49,8 @@ if [[ ${BUILD_ENVIRONMENT} == *"parallelnative"* ]]; then

				fi

				# Enable LLVM dependency for TensorExpr testing

				if [[ "$BUILD_ENVIRONMENT" == *rocm* ]]; then

				  export USE_LLVM=/opt/rocm/llvm

				  export LLVM_DIR=/opt/rocm/llvm/lib/cmake/llvm

				else

				  export USE_LLVM=/opt/llvm

				  export LLVM_DIR=/opt/llvm/lib/cmake/llvm

				fi

				export USE_LLVM=/opt/llvm

				export LLVM_DIR=/opt/llvm/lib/cmake/llvm

				if [[ "$BUILD_ENVIRONMENT" == *executorch* ]]; then

				  # To build test_edge_op_registration

				@ -92,7 +87,7 @@ else

				  # Workaround required for MKL library linkage

				  # https://github.com/pytorch/pytorch/issues/119557

				  if [ "$ANACONDA_PYTHON_VERSION" = "3.12" ]; then

				  if [[ "$ANACONDA_PYTHON_VERSION" = "3.12" || "$ANACONDA_PYTHON_VERSION" = "3.13" ]]; then

				    export CMAKE_LIBRARY_PATH="/opt/conda/envs/py_$ANACONDA_PYTHON_VERSION/lib/"

				    export CMAKE_INCLUDE_PATH="/opt/conda/envs/py_$ANACONDA_PYTHON_VERSION/include/"

				  fi

				@ -183,7 +178,7 @@ fi

				# sccache will fail for CUDA builds if all cores are used for compiling

				# gcc 7 with sccache seems to have intermittent OOM issue if all cores are used

				if [ -z "$MAX_JOBS" ]; then

				  if { [[ "$BUILD_ENVIRONMENT" == *cuda* ]] || [[ "$BUILD_ENVIRONMENT" == *gcc7* ]]; } && which sccache > /dev/null; then

				  if { [[ "$BUILD_ENVIRONMENT" == *cuda* ]]; } && which sccache > /dev/null; then

				    export MAX_JOBS=$(($(nproc) - 1))

				  fi

				fi

				@ -196,7 +191,7 @@ fi

				# We only build FlashAttention files for CUDA 8.0+, and they require large amounts of

				# memory to build and will OOM

				if [[ "$BUILD_ENVIRONMENT" == *cuda* ]] && [[ "$TORCH_CUDA_ARCH_LIST" == *"8.6"* || "$TORCH_CUDA_ARCH_LIST" == *"8.0"* ]]; then

				if [[ "$BUILD_ENVIRONMENT" == *cuda* ]] && [[ 1 -eq $(echo "${TORCH_CUDA_ARCH_LIST} >= 8.0" | bc) ]]; then

				  echo "WARNING: FlashAttention files require large amounts of memory to build and will OOM"

				  echo "Setting MAX_JOBS=(nproc-2)/3 to reduce memory usage"

				  export MAX_JOBS="$(( $(nproc --ignore=2) / 3 ))"

				@ -208,10 +203,12 @@ if [[ "${BUILD_ENVIRONMENT}" == *clang* ]]; then

				fi

				if [[ "$BUILD_ENVIRONMENT" == *-clang*-asan* ]]; then

				  export LDSHARED="clang --shared"

				  export USE_CUDA=0

				  if [[ "$BUILD_ENVIRONMENT" == *cuda* ]]; then

				    export USE_CUDA=1

				  fi

				  export USE_ASAN=1

				  export UBSAN_FLAGS="-fno-sanitize-recover=all;-fno-sanitize=float-divide-by-zero;-fno-sanitize=float-cast-overflow"

				  export REL_WITH_DEB_INFO=1

				  export UBSAN_FLAGS="-fno-sanitize-recover=all"

				  unset USE_LLVM

				fi

				@ -223,10 +220,6 @@ if [[ "${BUILD_ENVIRONMENT}" == *-pch* ]]; then

				    export USE_PRECOMPILED_HEADERS=1

				fi

				if [[ "${BUILD_ENVIRONMENT}" == *linux-focal-py3.7-gcc7-build*  ]]; then

				  export USE_GLOO_WITH_OPENSSL=ON

				fi

				if [[ "${BUILD_ENVIRONMENT}" != *android* && "${BUILD_ENVIRONMENT}" != *cuda* ]]; then

				  export BUILD_STATIC_RUNTIME_BENCHMARK=ON

				fi

				@ -237,7 +230,7 @@ fi

				# Do not change workspace permissions for ROCm CI jobs

				# as it can leave workspace with bad permissions for cancelled jobs

				if [[ "$BUILD_ENVIRONMENT" != *rocm* ]]; then

				if [[ "$BUILD_ENVIRONMENT" != *rocm* && "$BUILD_ENVIRONMENT" != *s390x* && -d /var/lib/jenkins/workspace ]]; then

				  # Workaround for dind-rootless userid mapping (https://github.com/pytorch/ci-infra/issues/96)

				  WORKSPACE_ORIGINAL_OWNER_ID=$(stat -c '%u' "/var/lib/jenkins/workspace")

				  cleanup_workspace() {

				@ -254,10 +247,9 @@ if [[ "$BUILD_ENVIRONMENT" != *rocm* ]]; then

				fi

				if [[ "$BUILD_ENVIRONMENT" == *-bazel-* ]]; then

				  set -e

				  set -e -o pipefail

				  get_bazel

				  install_sccache_nvcc_for_bazel

				  # Leave 1 CPU free and use only up to 80% of memory to reduce the change of crashing

				  # the runner

				@ -286,14 +278,13 @@ else

				          "$BUILD_ENVIRONMENT" != *xla* ]]; then

				      if [[ "$BUILD_ENVIRONMENT" != *py3.8* ]]; then

				        # Install numpy-2.0.2 for builds which are backward compatible with 1.X

				        python -mpip install --pre numpy==2.0.2

				        python -mpip install numpy==2.0.2

				      fi

				      WERROR=1 python setup.py clean

				      if [[ "$USE_SPLIT_BUILD" == "true" ]]; then

				        BUILD_LIBTORCH_WHL=1 BUILD_PYTHON_ONLY=0 python setup.py bdist_wheel

				        BUILD_LIBTORCH_WHL=0 BUILD_PYTHON_ONLY=1 python setup.py bdist_wheel --cmake

				        python3 tools/packaging/split_wheel.py bdist_wheel

				      else

				        WERROR=1 python setup.py bdist_wheel

				      fi

				@ -345,11 +336,11 @@ else

				    CUSTOM_OP_BUILD="${CUSTOM_TEST_ARTIFACT_BUILD_DIR}/custom-op-build"

				    CUSTOM_OP_TEST="$PWD/test/custom_operator"

				    python --version

				    SITE_PACKAGES="$(python -c 'from distutils.sysconfig import get_python_lib; print(get_python_lib())')"

				    SITE_PACKAGES="$(python -c 'import site; print(";".join([x for x in site.getsitepackages()] + [x + "/torch" for x in site.getsitepackages()]))')"

				    mkdir -p "$CUSTOM_OP_BUILD"

				    pushd "$CUSTOM_OP_BUILD"

				    cmake "$CUSTOM_OP_TEST" -DCMAKE_PREFIX_PATH="$SITE_PACKAGES/torch;$SITE_PACKAGES" -DPython_EXECUTABLE="$(which python)" \

				    cmake "$CUSTOM_OP_TEST" -DCMAKE_PREFIX_PATH="$SITE_PACKAGES" -DPython_EXECUTABLE="$(which python)" \

				          -DCMAKE_MODULE_PATH="$CUSTOM_TEST_MODULE_PATH" -DUSE_ROCM="$CUSTOM_TEST_USE_ROCM"

				    make VERBOSE=1

				    popd

				@ -359,10 +350,10 @@ else

				    JIT_HOOK_BUILD="${CUSTOM_TEST_ARTIFACT_BUILD_DIR}/jit-hook-build"

				    JIT_HOOK_TEST="$PWD/test/jit_hooks"

				    python --version

				    SITE_PACKAGES="$(python -c 'from distutils.sysconfig import get_python_lib; print(get_python_lib())')"

				    SITE_PACKAGES="$(python -c 'import site; print(";".join([x for x in site.getsitepackages()] + [x + "/torch" for x in site.getsitepackages()]))')"

				    mkdir -p "$JIT_HOOK_BUILD"

				    pushd "$JIT_HOOK_BUILD"

				    cmake "$JIT_HOOK_TEST" -DCMAKE_PREFIX_PATH="$SITE_PACKAGES/torch;$SITE_PACKAGES" -DPython_EXECUTABLE="$(which python)" \

				    cmake "$JIT_HOOK_TEST" -DCMAKE_PREFIX_PATH="$SITE_PACKAGES" -DPython_EXECUTABLE="$(which python)" \

				          -DCMAKE_MODULE_PATH="$CUSTOM_TEST_MODULE_PATH" -DUSE_ROCM="$CUSTOM_TEST_USE_ROCM"

				    make VERBOSE=1

				    popd

				@ -374,7 +365,7 @@ else

				    python --version

				    mkdir -p "$CUSTOM_BACKEND_BUILD"

				    pushd "$CUSTOM_BACKEND_BUILD"

				    cmake "$CUSTOM_BACKEND_TEST" -DCMAKE_PREFIX_PATH="$SITE_PACKAGES/torch;$SITE_PACKAGES" -DPython_EXECUTABLE="$(which python)" \

				    cmake "$CUSTOM_BACKEND_TEST" -DCMAKE_PREFIX_PATH="$SITE_PACKAGES" -DPython_EXECUTABLE="$(which python)" \

				          -DCMAKE_MODULE_PATH="$CUSTOM_TEST_MODULE_PATH" -DUSE_ROCM="$CUSTOM_TEST_USE_ROCM"

				    make VERBOSE=1

				    popd

				@ -404,9 +395,7 @@ if [[ "$BUILD_ENVIRONMENT" != *libtorch* && "$BUILD_ENVIRONMENT" != *bazel* ]];

				  # don't do this for libtorch as libtorch is C++ only and thus won't have python tests run on its build

				  python tools/stats/export_test_times.py

				fi

				# snadampal: skipping it till sccache support added for aarch64

				# https://github.com/pytorch/pytorch/issues/121559

				if [[ "$BUILD_ENVIRONMENT" != *aarch64* ]]; then

				# don't do this for bazel or s390x as they don't use sccache

				if [[ "$BUILD_ENVIRONMENT" != *s390x* && "$BUILD_ENVIRONMENT" != *-bazel-* ]]; then

				  print_sccache_stats

				fi

									
										394

.ci/pytorch/check_binary.sh
									
										Executable file
									
												View File
												
				@ -0,0 +1,394 @@

				#!/bin/bash

				# shellcheck disable=SC2086,SC2006,SC2207,SC2076,SC2155,SC2046,SC1091,SC2143

				# TODO: Re-enable shellchecks above

				set -eux -o pipefail

				# This script checks the following things on binaries

				# 1. The gcc abi matches DESIRED_DEVTOOLSET

				# 2. MacOS binaries do not link against OpenBLAS

				# 3. There are no protobuf symbols of any sort anywhere (turned off, because

				#    this is currently not true)

				# 4. Standard Python imports work

				# 5. MKL is available everywhere except for MacOS wheels

				# 6. XNNPACK is available everywhere except for MacOS wheels

				# 7. CUDA is setup correctly and does not hang

				# 8. Magma is available for CUDA builds

				# 9. CuDNN is available for CUDA builds

				#

				# This script needs the env variables DESIRED_PYTHON, DESIRED_CUDA,

				# DESIRED_DEVTOOLSET and PACKAGE_TYPE

				#

				# This script expects PyTorch to be installed into the active Python (the

				# Python returned by `which python`). Or, if this is testing a libtorch

				# Pythonless binary, then it expects to be in the root folder of the unzipped

				# libtorch package.

				if [[ -z ${DESIRED_PYTHON:-} ]]; then

				  export DESIRED_PYTHON=${MATRIX_PYTHON_VERSION:-}

				fi

				if [[ -z ${DESIRED_CUDA:-} ]]; then

				  export DESIRED_CUDA=${MATRIX_DESIRED_CUDA:-}

				fi

				if [[ -z ${DESIRED_DEVTOOLSET:-} ]]; then

				  export DESIRED_DEVTOOLSET=${MATRIX_DESIRED_DEVTOOLSET:-}

				fi

				if [[ -z ${PACKAGE_TYPE:-} ]]; then

				  export PACKAGE_TYPE=${MATRIX_PACKAGE_TYPE:-}

				fi

				# The install root depends on both the package type and the os

				# All MacOS packages use conda, even for the wheel packages.

				if [[ "$PACKAGE_TYPE" == libtorch ]]; then

				  # NOTE: Only $PWD works on both CentOS and Ubuntu

				  export install_root="$PWD"

				else

				  if [[ $DESIRED_PYTHON =~ ([0-9].[0-9]+)t ]]; then

				    # For python that is maj.mint keep original version

				    py_dot="$DESIRED_PYTHON"

				  elif [[ $DESIRED_PYTHON =~ ([0-9].[0-9]+) ]];  then

				    # Strip everything but major.minor from DESIRED_PYTHON version

				    py_dot="${BASH_REMATCH[0]}"

				  else

				    echo "Unexpected ${DESIRED_PYTHON} format"

				    exit 1

				  fi

				  export install_root="$(dirname $(which python))/../lib/python${py_dot}/site-packages/torch/"

				fi

				###############################################################################

				# Setup XPU ENV

				###############################################################################

				if [[ "$DESIRED_CUDA" == 'xpu' ]]; then

				  set +u

				  # Refer https://www.intel.com/content/www/us/en/developer/articles/tool/pytorch-prerequisites-for-intel-gpus.html

				  source /opt/intel/oneapi/compiler/latest/env/vars.sh

				  source /opt/intel/oneapi/pti/latest/env/vars.sh

				fi

				###############################################################################

				# Check GCC ABI

				###############################################################################

				# NOTE [ Building libtorch with old vs. new gcc ABI ]

				#

				# Packages built with one version of ABI could not be linked against by client

				# C++ libraries that were compiled using the other version of ABI. Since both

				# gcc ABIs are still common in the wild, we need to support both ABIs. Currently:

				#

				# - All the nightlies built on CentOS 7 + devtoolset7 use the old gcc ABI.

				# - All the nightlies built on Ubuntu 16.04 + gcc 5.4 use the new gcc ABI.

				echo "Checking that the gcc ABI is what we expect"

				if [[ "$(uname)" != 'Darwin' ]]; then

				  function is_expected() {

				    if [[ "$DESIRED_DEVTOOLSET" == *"cxx11-abi"* || "$DESIRED_CUDA" == *"rocm"* ]]; then

				      if [[ "$1" -gt 0 || "$1" == "ON " ]]; then

				        echo 1

				      fi

				    else

				      if [[ -z "$1" || "$1" == 0 || "$1" == "OFF" ]]; then

				        echo 1

				      fi

				    fi

				  }

				  # First we check that the env var in TorchConfig.cmake is correct

				  # We search for D_GLIBCXX_USE_CXX11_ABI=1 in torch/TorchConfig.cmake

				  torch_config="${install_root}/share/cmake/Torch/TorchConfig.cmake"

				  if [[ ! -f "$torch_config" ]]; then

				    echo "No TorchConfig.cmake found!"

				    ls -lah "$install_root/share/cmake/Torch"

				    exit 1

				  fi

				  echo "Checking the TorchConfig.cmake"

				  cat "$torch_config"

				  # The sed call below is

				  #   don't print lines by default (only print the line we want)

				  # -n

				  #   execute the following expression

				  # e

				  #   replace lines that match with the first capture group and print

				  # s/.*D_GLIBCXX_USE_CXX11_ABI=\(.\)".*/\1/p

				  #   any characters, D_GLIBCXX_USE_CXX11_ABI=, exactly one any character, a

				  #   quote, any characters

				  #   Note the exactly one single character after the '='. In the case that the

				  #     variable is not set the '=' will be followed by a '"' immediately and the

				  #     line will fail the match and nothing will be printed; this is what we

				  #     want.  Otherwise it will capture the 0 or 1 after the '='.

				  # /.*D_GLIBCXX_USE_CXX11_ABI=\(.\)".*/

				  #   replace the matched line with the capture group and print

				  # /\1/p

				  actual_gcc_abi="$(sed -ne 's/.*D_GLIBCXX_USE_CXX11_ABI=\(.\)".*/\1/p' < "$torch_config")"

				  if [[ "$(is_expected "$actual_gcc_abi")" != 1 ]]; then

				    echo "gcc ABI $actual_gcc_abi not as expected."

				    exit 1

				  fi

				  # We also check that there are [not] cxx11 symbols in libtorch

				  #

				  echo "Checking that symbols in libtorch.so have the right gcc abi"

				  python3 "$(dirname ${BASH_SOURCE[0]})/smoke_test/check_binary_symbols.py"

				  echo "cxx11 symbols seem to be in order"

				fi # if on Darwin

				###############################################################################

				# Check for no OpenBLAS

				# TODO Check for no Protobuf symbols (not finished)

				# Print *all* runtime dependencies

				###############################################################################

				# We have to loop through all shared libraries for this

				if [[ "$(uname)" == 'Darwin' ]]; then

				  all_dylibs=($(find "$install_root" -name '*.dylib'))

				  for dylib in "${all_dylibs[@]}"; do

				    echo "All dependencies of $dylib are $(otool -L $dylib) with rpath $(otool -l $dylib | grep LC_RPATH -A2)"

				    # Check that OpenBlas is not linked to on Macs

				    echo "Checking the OpenBLAS is not linked to"

				    if [[ -n "$(otool -L $dylib | grep -i openblas)" ]]; then

				      echo "ERROR: Found openblas as a dependency of $dylib"

				      echo "Full dependencies is: $(otool -L $dylib)"

				      exit 1

				    fi

				    # Check for protobuf symbols

				    #proto_symbols="$(nm $dylib | grep protobuf)" || true

				    #if [[ -n "$proto_symbols" ]]; then

				    #  echo "ERROR: Detected protobuf symbols in $dylib"

				    #  echo "Symbols are $proto_symbols"

				    #  exit 1

				    #fi

				  done

				else

				  all_libs=($(find "$install_root" -name '*.so'))

				  for lib in "${all_libs[@]}"; do

				    echo "All dependencies of $lib are $(ldd $lib) with runpath $(objdump -p $lib | grep RUNPATH)"

				    # Check for protobuf symbols

				    #proto_symbols=$(nm $lib | grep protobuf) || true

				    #if [[ -n "$proto_symbols" ]]; then

				    #  echo "ERROR: Detected protobuf symbols in $lib"

				    #  echo "Symbols are $proto_symbols"

				    #  exit 1

				    #fi

				  done

				fi

				setup_link_flags () {

				  REF_LIB="-Wl,-R${install_root}/lib"

				  if [[ "$(uname)" == 'Darwin' ]]; then

				    REF_LIB="-Wl,-rpath ${install_root}/lib"

				  fi

				  ADDITIONAL_LINKER_FLAGS=""

				  if [[ "$(uname)" == 'Linux' ]]; then

				    ADDITIONAL_LINKER_FLAGS="-Wl,--no-as-needed"

				  fi

				  C10_LINK_FLAGS=""

				  if [ -f "${install_root}/lib/libc10.so" ] || [ -f "${install_root}/lib/libc10.dylib" ]; then

				    C10_LINK_FLAGS="-lc10"

				  fi

				  TORCH_CPU_LINK_FLAGS=""

				  if [ -f "${install_root}/lib/libtorch_cpu.so" ] || [ -f "${install_root}/lib/libtorch_cpu.dylib" ]; then

				    TORCH_CPU_LINK_FLAGS="-ltorch_cpu"

				  fi

				  TORCH_CUDA_LINK_FLAGS=""

				  if [ -f "${install_root}/lib/libtorch_cuda.so" ] || [ -f "${install_root}/lib/libtorch_cuda.dylib" ]; then

				    TORCH_CUDA_LINK_FLAGS="-ltorch_cuda"

				  elif [ -f "${install_root}/lib/libtorch_cuda_cpp.so" ] && [ -f "${install_root}/lib/libtorch_cuda_cpp.so" ] || \

				    [ -f "${install_root}/lib/libtorch_cuda_cu.dylib" ] && [ -f "${install_root}/lib/libtorch_cuda_cu.dylib" ]; then

				    TORCH_CUDA_LINK_FLAGS="-ltorch_cuda_cpp -ltorch_cuda_cu"

				  fi

				}

				TEST_CODE_DIR="$(dirname $(realpath ${BASH_SOURCE[0]}))/test_example_code"

				build_and_run_example_cpp () {

				  if [[ "$DESIRED_DEVTOOLSET" == *"cxx11-abi"* ]]; then

				    GLIBCXX_USE_CXX11_ABI=1

				  else

				    GLIBCXX_USE_CXX11_ABI=0

				  fi

				  setup_link_flags

				  g++ ${TEST_CODE_DIR}/$1.cpp -I${install_root}/include -I${install_root}/include/torch/csrc/api/include -D_GLIBCXX_USE_CXX11_ABI=$GLIBCXX_USE_CXX11_ABI -std=gnu++17 -L${install_root}/lib ${REF_LIB} ${ADDITIONAL_LINKER_FLAGS} -ltorch $TORCH_CPU_LINK_FLAGS $TORCH_CUDA_LINK_FLAGS $C10_LINK_FLAGS -o $1

				  ./$1

				}

				build_example_cpp_with_incorrect_abi () {

				  if [[ "$DESIRED_DEVTOOLSET" == *"cxx11-abi"* ]]; then

				    GLIBCXX_USE_CXX11_ABI=0

				  else

				    GLIBCXX_USE_CXX11_ABI=1

				  fi

				  set +e

				  setup_link_flags

				  g++ ${TEST_CODE_DIR}/$1.cpp -I${install_root}/include -I${install_root}/include/torch/csrc/api/include -D_GLIBCXX_USE_CXX11_ABI=$GLIBCXX_USE_CXX11_ABI -std=gnu++17 -L${install_root}/lib ${REF_LIB} ${ADDITIONAL_LINKER_FLAGS} -ltorch $TORCH_CPU_LINK_FLAGS $TORCH_CUDA_LINK_FLAGS $C10_LINK_FLAGS -o $1

				  ERRCODE=$?

				  set -e

				  if [ "$ERRCODE" -eq "0" ]; then

				    echo "Building example with incorrect ABI didn't throw error. Aborting."

				    exit 1

				  else

				    echo "Building example with incorrect ABI throws expected error. Proceeding."

				  fi

				}

				###############################################################################

				# Check simple Python/C++ calls

				###############################################################################

				if [[ "$PACKAGE_TYPE" == 'libtorch' ]]; then

				  # NS: Set LD_LIBRARY_PATH for CUDA builds, but perhaps it should be removed

				  if [[ "$DESIRED_CUDA" == "cu"* ]]; then

				    export LD_LIBRARY_PATH=/usr/local/cuda/lib64

				  fi

				  build_and_run_example_cpp simple-torch-test

				  # `_GLIBCXX_USE_CXX11_ABI` is always ignored by gcc in devtoolset7, so we test

				  # the expected failure case for Ubuntu 16.04 + gcc 5.4 only.

				  if [[ "$DESIRED_DEVTOOLSET" == *"cxx11-abi"* ]]; then

				    build_example_cpp_with_incorrect_abi simple-torch-test

				  fi

				else

				  pushd /tmp

				  python -c 'import torch'

				  popd

				fi

				###############################################################################

				# Check torch.git_version

				###############################################################################

				if [[ "$PACKAGE_TYPE" != 'libtorch' ]]; then

				  pushd /tmp

				  python -c 'import torch; assert torch.version.git_version != "Unknown"'

				  python -c 'import torch; assert torch.version.git_version != None'

				  popd

				fi

				###############################################################################

				# Check for MKL

				###############################################################################

				if [[ "$PACKAGE_TYPE" == 'libtorch' ]]; then

				  echo "Checking that MKL is available"

				  build_and_run_example_cpp check-torch-mkl

				elif [[ "$(uname -m)" != "arm64" && "$(uname -m)" != "s390x" ]]; then

				  if [[ "$(uname)" != 'Darwin' || "$PACKAGE_TYPE" != *wheel ]]; then

				    if [[ "$(uname -m)" == "aarch64" ]]; then

				      echo "Checking that MKLDNN is available on aarch64"

				      pushd /tmp

				      python -c 'import torch; exit(0 if torch.backends.mkldnn.is_available() else 1)'

				      popd

				    else

				      echo "Checking that MKL is available"

				      pushd /tmp

				      python -c 'import torch; exit(0 if torch.backends.mkl.is_available() else 1)'

				      popd

				    fi

				  fi

				fi

				###############################################################################

				# Check for XNNPACK

				###############################################################################

				if [[ "$PACKAGE_TYPE" == 'libtorch' ]]; then

				  echo "Checking that XNNPACK is available"

				  build_and_run_example_cpp check-torch-xnnpack

				else

				  if [[ "$(uname)" != 'Darwin' || "$PACKAGE_TYPE" != *wheel ]] && [[ "$(uname -m)" != "s390x"  ]]; then

				    echo "Checking that XNNPACK is available"

				    pushd /tmp

				    python -c 'import torch.backends.xnnpack; exit(0 if torch.backends.xnnpack.enabled else 1)'

				    popd

				  fi

				fi

				###############################################################################

				# Check CUDA configured correctly

				###############################################################################

				# Skip these for Windows machines without GPUs

				if [[ "$OSTYPE" == "msys" ]]; then

				    GPUS=$(wmic path win32_VideoController get name)

				    if [[ ! "$GPUS" == *NVIDIA* ]]; then

				        echo "Skip CUDA tests for machines without a Nvidia GPU card"

				        exit 0

				    fi

				fi

				# Test that CUDA builds are setup correctly

				if [[ "$DESIRED_CUDA" != 'cpu' && "$DESIRED_CUDA" != 'xpu' && "$DESIRED_CUDA" != 'cpu-cxx11-abi' && "$DESIRED_CUDA" != *"rocm"* && "$(uname -m)" != "s390x" ]]; then

				  if [[ "$PACKAGE_TYPE" == 'libtorch' ]]; then

				    build_and_run_example_cpp check-torch-cuda

				  else

				    pushd /tmp

				    echo "Checking that CUDA archs are setup correctly"

				    timeout 20 python -c 'import torch; torch.randn([3,5]).cuda()'

				    # These have to run after CUDA is initialized

				    echo "Checking that magma is available"

				    python -c 'import torch; torch.rand(1).cuda(); exit(0 if torch.cuda.has_magma else 1)'

				    echo "Checking that CuDNN is available"

				    python -c 'import torch; exit(0 if torch.backends.cudnn.is_available() else 1)'

				    # Validates builds is free of linker regressions reported in https://github.com/pytorch/pytorch/issues/57744

				    echo "Checking that exception handling works"

				    python -c "import torch; from unittest import TestCase;TestCase().assertRaises(RuntimeError, lambda:torch.eye(7, 7, device='cuda:7'))"

				    echo "Checking that basic RNN works"

				    python ${TEST_CODE_DIR}/rnn_smoke.py

				    echo "Checking that basic CNN works"

				    python "${TEST_CODE_DIR}/cnn_smoke.py"

				    echo "Test that linalg works"

				    python -c "import torch;x=torch.rand(3,3,device='cuda');print(torch.linalg.svd(torch.mm(x.t(), x)))"

				    popd

				  fi # if libtorch

				fi # if cuda

				##########################

				# Run parts of smoke tests

				##########################

				if [[ "$PACKAGE_TYPE" != 'libtorch' ]]; then

				  pushd "$(dirname ${BASH_SOURCE[0]})/smoke_test"

				  python -c "from smoke_test import test_linalg; test_linalg()"

				  if [[ "$DESIRED_CUDA" == *cuda* ]]; then

				    python -c "from smoke_test import test_linalg; test_linalg('cuda')"

				  fi

				  popd

				fi

				###############################################################################

				# Check PyTorch supports TCP_TLS gloo transport

				###############################################################################

				if [[ "$(uname)" == 'Linux' && "$PACKAGE_TYPE" != 'libtorch' ]]; then

				  GLOO_CHECK="import torch.distributed as dist

				try:

				    dist.init_process_group('gloo', rank=0, world_size=1)

				except RuntimeError as e:

				    print(e)

				"

				  RESULT=`GLOO_DEVICE_TRANSPORT=TCP_TLS MASTER_ADDR=localhost MASTER_PORT=63945 python -c "$GLOO_CHECK"`

				  GLOO_TRANSPORT_IS_NOT_SUPPORTED='gloo transport is not supported'

				  if [[ "$RESULT" =~ "$GLOO_TRANSPORT_IS_NOT_SUPPORTED" ]]; then

				    echo "PyTorch doesn't support TLS_TCP transport, please build with USE_GLOO_WITH_OPENSSL=1"

				    exit 1

				  fi

				fi

				###############################################################################

				# Check for C++ ABI compatibility between gcc7 and gcc9 compiled binaries

				###############################################################################

				if [[ "$(uname)" == 'Linux' && ("$PACKAGE_TYPE" == 'conda' || "$PACKAGE_TYPE" == 'manywheel')]]; then

				  pushd /tmp

				  python -c "import torch; exit(0 if torch.compiled_with_cxx11_abi() else (0 if torch._C._PYBIND11_BUILD_ABI == '_cxxabi1011' else 1))"

				  popd

				fi

									
										6

.ci/pytorch/common-build.sh
									
												View File
												
				@ -6,6 +6,12 @@ if [[ "$BUILD_ENVIRONMENT" != *win-* ]]; then

				    # Save the absolute path in case later we chdir (as occurs in the gpu perf test)

				    script_dir="$( cd "$(dirname "${BASH_SOURCE[0]}")" || exit ; pwd -P )"

				    if [[ "${BUILD_ENVIRONMENT}" == *-pch* ]]; then

				        # This is really weird, but newer sccache somehow produces broken binary

				        # see https://github.com/pytorch/pytorch/issues/139188

				        sudo mv /opt/cache/bin/sccache-0.2.14a /opt/cache/bin/sccache

				    fi

				    if which sccache > /dev/null; then

				        # Save sccache logs to file

				        sccache --stop-server > /dev/null  2>&1 || true

									
										2

.ci/pytorch/common.sh
									
												View File
												
				@ -3,7 +3,7 @@

				# Common setup for all Jenkins scripts

				# shellcheck source=./common_utils.sh

				source "$(dirname "${BASH_SOURCE[0]}")/common_utils.sh"

				set -ex

				set -ex -o pipefail

				# Required environment variables:

				#   $BUILD_ENVIRONMENT (should be set by your Docker image)

									
										54

.ci/pytorch/common_utils.sh
									
												View File
												
				@ -81,14 +81,15 @@ function pip_install_whl() {

				function pip_install() {

				  # retry 3 times

				  # old versions of pip don't have the "--progress-bar" flag

				  pip install --progress-bar off "$@" || pip install --progress-bar off "$@" || pip install --progress-bar off "$@" ||\

				  pip install "$@" || pip install "$@" || pip install "$@"

				  pip_install_pkg="python3 -m pip install --progress-bar off"

				  ${pip_install_pkg} "$@" || \

				    ${pip_install_pkg} "$@" || \

				    ${pip_install_pkg} "$@"

				}

				function pip_uninstall() {

				  # uninstall 2 times

				  pip uninstall -y "$@" || pip uninstall -y "$@"

				  pip3 uninstall -y "$@" || pip3 uninstall -y "$@"

				}

				function get_exit_code() {

				@ -104,32 +105,12 @@ function get_bazel() {

				  # version of Bazelisk to fetch the platform specific version of

				  # Bazel to use from .bazelversion.

				  retry curl --location --output tools/bazel \

				    https://raw.githubusercontent.com/bazelbuild/bazelisk/v1.16.0/bazelisk.py

				    https://raw.githubusercontent.com/bazelbuild/bazelisk/v1.23.0/bazelisk.py

				  shasum --algorithm=1 --check \

				    <(echo 'd4369c3d293814d3188019c9f7527a948972d9f8  tools/bazel')

				    <(echo '01df9cf7f08dd80d83979ed0d0666a99349ae93c  tools/bazel')

				  chmod u+x tools/bazel

				}

				# This function is bazel specific because of the bug

				# in the bazel that requires some special paths massaging

				# as a workaround. See

				# https://github.com/bazelbuild/bazel/issues/10167

				function install_sccache_nvcc_for_bazel() {

				  sudo mv /usr/local/cuda/bin/nvcc /usr/local/cuda/bin/nvcc-real

				  # Write the `/usr/local/cuda/bin/nvcc`

				  cat << EOF | sudo tee /usr/local/cuda/bin/nvcc

				#!/bin/sh

				if [ \$(env -u LD_PRELOAD ps -p \$PPID -o comm=) != sccache ]; then

				  exec sccache /usr/local/cuda/bin/nvcc "\$@"

				else

				  exec external/local_cuda/cuda/bin/nvcc-real "\$@"

				fi

				EOF

				  sudo chmod +x /usr/local/cuda/bin/nvcc

				}

				function install_monkeytype {

				  # Install MonkeyType

				  pip_install MonkeyType

				@ -179,7 +160,7 @@ function install_torchvision() {

				}

				function install_tlparse() {

				  pip_install --user "tlparse==0.3.25"

				  pip_install --user "tlparse==0.3.30"

				  PATH="$(python -m site --user-base)/bin:$PATH"

				}

				@ -191,9 +172,22 @@ function install_torchrec_and_fbgemm() {

				  pip_uninstall torchrec-nightly

				  pip_uninstall fbgemm-gpu-nightly

				  pip_install setuptools-git-versioning scikit-build pyre-extensions

				  # TODO (huydhn): I still have no clue on why sccache doesn't work with only fbgemm_gpu here, but it

				  # seems to be an sccache-related issue

				  if [[ "$IS_A100_RUNNER" == "1" ]]; then

				    unset CMAKE_CUDA_COMPILER_LAUNCHER

				    sudo mv /opt/cache/bin /opt/cache/bin-backup

				  fi

				  # See https://github.com/pytorch/pytorch/issues/106971

				  CUDA_PATH=/usr/local/cuda-12.1 pip_install --no-use-pep517 --user "git+https://github.com/pytorch/FBGEMM.git@${fbgemm_commit}#egg=fbgemm-gpu&subdirectory=fbgemm_gpu"

				  pip_install --no-use-pep517 --user "git+https://github.com/pytorch/torchrec.git@${torchrec_commit}"

				  if [[ "$IS_A100_RUNNER" == "1" ]]; then

				    export CMAKE_CUDA_COMPILER_LAUNCHER=/opt/cache/bin/sccache

				    sudo mv /opt/cache/bin-backup /opt/cache/bin

				  fi

				}

				function clone_pytorch_xla() {

				@ -227,6 +221,12 @@ function checkout_install_torchbench() {

				  popd

				}

				function install_torchao() {

				  local commit

				  commit=$(get_pinned_commit torchao)

				  pip_install --no-use-pep517 --user "git+https://github.com/pytorch/ao.git@${commit}"

				}

				function print_sccache_stats() {

				  echo 'PyTorch Build Statistics'

				  sccache --show-stats

									
										2

.ci/pytorch/cpp_doc_push_script.sh
									
												View File
												
				@ -40,7 +40,7 @@ echo "Building PyTorch C++ API docs..."

				rm -rf cppdocs

				git clone https://github.com/pytorch/cppdocs

				set -ex

				set -ex -o pipefail

				# Generate ATen files

				pushd "${pt_checkout}"

									
										12

.ci/pytorch/create_test_cert.py
									
												View File
												
				@ -1,4 +1,4 @@

				from datetime import datetime, timedelta

				from datetime import datetime, timedelta, timezone

				from tempfile import mkdtemp

				from cryptography import x509

				@ -42,11 +42,10 @@ def create_cert(path, C, ST, L, O, key):

				        .issuer_name(issuer)

				        .public_key(key.public_key())

				        .serial_number(x509.random_serial_number())

				        .not_valid_before(datetime.utcnow())

				        .not_valid_before(datetime.now(timezone.utc))

				        .not_valid_after(

				            # Our certificate will be valid for 10 days

				            datetime.utcnow()

				            + timedelta(days=10)

				            datetime.now(timezone.utc) + timedelta(days=10)

				        )

				        .add_extension(

				            x509.BasicConstraints(ca=True, path_length=None),

				@ -88,11 +87,10 @@ def sign_certificate_request(path, csr_cert, ca_cert, private_ca_key):

				        .issuer_name(ca_cert.subject)

				        .public_key(csr_cert.public_key())

				        .serial_number(x509.random_serial_number())

				        .not_valid_before(datetime.utcnow())

				        .not_valid_before(datetime.now(timezone.utc))

				        .not_valid_after(

				            # Our certificate will be valid for 10 days

				            datetime.utcnow()

				            + timedelta(days=10)

				            datetime.now(timezone.utc) + timedelta(days=10)

				            # Sign our certificate with our private key

				        )

				        .sign(private_ca_key, hashes.SHA256())

									
										2

.ci/pytorch/functorch_doc_push_script.sh
									
												View File
												
				@ -5,7 +5,7 @@ pt_checkout="/var/lib/jenkins/workspace"

				source "$pt_checkout/.ci/pytorch/common_utils.sh"

				echo "functorch_doc_push_script.sh: Invoked with $*"

				set -ex

				set -ex -o pipefail

				version=${DOCS_VERSION:-nightly}

				echo "version: $version"

									
										2

.ci/pytorch/install_cache_xla.sh
									
												View File
												
				@ -6,7 +6,7 @@

				# return the same thing, ex checks for for rocm, CUDA, and changing the path

				# where sccache is installed, and not changing /etc/environment.

				set -ex

				set -ex -o pipefail

				install_binary() {

				  echo "Downloading sccache binary from S3 repo"

									
										159

.ci/pytorch/macos-test.sh
									
												View File
												
				@ -1,4 +1,5 @@

				#!/bin/bash

				set -x

				# shellcheck disable=SC2034

				# shellcheck source=./macos-common.sh

				@ -9,15 +10,13 @@ if [[ -n "$CONDA_ENV" ]]; then

				  export PATH="$CONDA_ENV/bin":$PATH

				fi

				# Test that OpenMP is enabled for non-arm64 build

				if [[ ${BUILD_ENVIRONMENT} != *arm64* ]]; then

				  pushd test

				  if [[ ! $(python -c "import torch; print(int(torch.backends.openmp.is_available()))") == "1" ]]; then

				    echo "Build should have OpenMP enabled, but torch.backends.openmp.is_available() is False"

				    exit 1

				  fi

				  popd

				# Test that OpenMP is enabled

				pushd test

				if [[ ! $(python -c "import torch; print(int(torch.backends.openmp.is_available()))") == "1" ]]; then

				  echo "Build should have OpenMP enabled, but torch.backends.openmp.is_available() is False"

				  exit 1

				fi

				popd

				setup_test_python() {

				  # The CircleCI worker hostname doesn't resolve to an address.

				@ -27,8 +26,9 @@ setup_test_python() {

				  echo "Ninja version: $(ninja --version)"

				  echo "Python version: $(which python) ($(python --version))"

				  # Increase default limit on open file handles from 256 to 1024

				  ulimit -n 1024

				  # Set the limit on open file handles to 16384

				  # might help with intermittent compiler test failures

				  ulimit -n 16384

				}

				test_python_all() {

				@ -149,9 +149,146 @@ test_jit_hooks() {

				  assert_git_not_dirty

				}

				torchbench_setup_macos() {

				  git clone --recursive https://github.com/pytorch/vision torchvision

				  git clone --recursive https://github.com/pytorch/audio torchaudio

				  pushd torchvision

				  git fetch

				  git checkout "$(cat ../.github/ci_commit_pins/vision.txt)"

				  git submodule update --init --recursive

				  python setup.py clean

				  python setup.py develop

				  popd

				  pushd torchaudio

				  git fetch

				  git checkout "$(cat ../.github/ci_commit_pins/audio.txt)"

				  git submodule update --init --recursive

				  python setup.py clean

				  python setup.py develop

				  popd

				  # Shellcheck doesn't like it when you pass no arguments to a function that can take args. See https://www.shellcheck.net/wiki/SC2120

				  # shellcheck disable=SC2119,SC2120

				  checkout_install_torchbench

				}

				conda_benchmark_deps() {

				  conda install -y astunparse numpy scipy ninja pyyaml setuptools cmake typing-extensions requests protobuf numba cython scikit-learn

				  conda install -y -c conda-forge librosa

				}

				test_torchbench_perf() {

				  print_cmake_info

				  echo "Launching torchbench setup"

				  conda_benchmark_deps

				  torchbench_setup_macos

				  TEST_REPORTS_DIR=$(pwd)/test/test-reports

				  mkdir -p "$TEST_REPORTS_DIR"

				  local backend=eager

				  local dtype=notset

				  local device=mps

				  echo "Setup complete, launching torchbench training performance run"

				  PYTHONPATH="$(pwd)"/torchbench python benchmarks/dynamo/torchbench.py \

				    --performance --backend "$backend" --training --devices "$device" \

				    --output "$TEST_REPORTS_DIR/inductor_${backend}_torchbench_${dtype}_training_${device}_performance.csv"

				  echo "Launching torchbench inference performance run"

				  PYTHONPATH="$(pwd)"/torchbench python benchmarks/dynamo/torchbench.py \

				    --performance --backend "$backend" --inference --devices "$device" \

				    --output "$TEST_REPORTS_DIR/inductor_${backend}_torchbench_${dtype}_inference_${device}_performance.csv"

				  echo "Pytorch benchmark on mps device completed"

				}

				test_torchbench_smoketest() {

				  print_cmake_info

				  echo "Launching torchbench setup"

				  conda_benchmark_deps

				  # shellcheck disable=SC2119,SC2120

				  torchbench_setup_macos

				  TEST_REPORTS_DIR=$(pwd)/test/test-reports

				  mkdir -p "$TEST_REPORTS_DIR"

				  local backend=eager

				  local dtype=notset

				  local device=mps

				  touch "$TEST_REPORTS_DIR/inductor_${backend}_torchbench_${dtype}_training_${device}_performance.csv"

				  touch "$TEST_REPORTS_DIR/inductor_${backend}_torchbench_${dtype}_inference_${device}_performance.csv"

				  echo "Setup complete, launching torchbench training performance run"

				  for model in hf_T5 llama BERT_pytorch dcgan hf_GPT2 yolov3 resnet152; do

				    PYTHONPATH="$(pwd)"/torchbench python benchmarks/dynamo/torchbench.py \

				      --performance --only "$model" --backend "$backend" --training --devices "$device" \

				      --output "$TEST_REPORTS_DIR/inductor_${backend}_torchbench_${dtype}_training_${device}_performance.csv"

				  done

				  echo "Launching torchbench inference performance run"

				  for model in hf_T5 llama BERT_pytorch dcgan hf_GPT2 yolov3 resnet152; do

				    PYTHONPATH="$(pwd)"/torchbench python benchmarks/dynamo/torchbench.py \

				      --performance --only "$model" --backend "$backend" --inference --devices "$device" \

				      --output "$TEST_REPORTS_DIR/inductor_${backend}_torchbench_${dtype}_inference_${device}_performance.csv"

				  done

				  echo "Pytorch benchmark on mps device completed"

				}

				test_hf_perf() {

				  print_cmake_info

				  TEST_REPORTS_DIR=$(pwd)/test/test-reports

				  mkdir -p "$TEST_REPORTS_DIR"

				  conda_benchmark_deps

				  torchbench_setup_macos

				  echo "Launching HuggingFace training perf run"

				  python "$(pwd)"/benchmarks/dynamo/huggingface.py --backend eager --device mps --performance --training --output="${TEST_REPORTS_DIR}"/hf_training.csv

				  echo "Launching HuggingFace inference perf run"

				  python "$(pwd)"/benchmarks/dynamo/huggingface.py --backend eager --device mps --performance --training --output="${TEST_REPORTS_DIR}"/hf_inference.csv

				  echo "HuggingFace benchmark on mps device completed"

				}

				test_timm_perf() {

				  print_cmake_info

				  TEST_REPORTS_DIR=$(pwd)/test/test-reports

				  mkdir -p "$TEST_REPORTS_DIR"

				  conda_benchmark_deps

				  torchbench_setup_macos

				  echo "Launching timm training perf run"

				  python "$(pwd)"/benchmarks/dynamo/timm_models.py --backend eager --device mps --performance --training --output="${TEST_REPORTS_DIR}"/timm_training.csv

				  echo "Launching timm inference perf run"

				  python "$(pwd)"/benchmarks/dynamo/timm_models.py --backend eager --device mps --performance --training --output="${TEST_REPORTS_DIR}"/timm_inference.csv

				  echo "timm benchmark on mps device completed"

				}

				install_tlparse

				if [[ $NUM_TEST_SHARDS -gt 1 ]]; then

				if [[ $TEST_CONFIG == *"perf_all"* ]]; then

				  test_torchbench_perf

				  test_hf_perf

				  test_timm_perf

				elif [[ $TEST_CONFIG == *"perf_torchbench"* ]]; then

				  test_torchbench_perf

				elif [[ $TEST_CONFIG == *"perf_hf"* ]]; then

				  test_hf_perf

				elif [[ $TEST_CONFIG == *"perf_timm"* ]]; then

				  test_timm_perf

				elif [[ $TEST_CONFIG == *"perf_smoketest"* ]]; then

				  test_torchbench_smoketest

				elif [[ $NUM_TEST_SHARDS -gt 1 ]]; then

				  test_python_shard "${SHARD_NUMBER}"

				  if [[ "${SHARD_NUMBER}" == 1 ]]; then

				    test_libtorch

									
										93

.ci/pytorch/multigpu-test.sh
									
												View File
												
				@ -8,55 +8,62 @@

				source "$(dirname "${BASH_SOURCE[0]}")/common.sh"

				echo "Testing pytorch"

				time python test/run_test.py --include test_cuda_multigpu test_cuda_primary_ctx --verbose

				# When adding more tests, please use HUD to see which shard is shorter

				if [[ "${SHARD_NUMBER:-1}" == "1" ]]; then

				    # FSDP tests

				    for f in test/distributed/fsdp/*.py ; do time python test/run_test.py --verbose -i "${f#*/}" ; done

				fi

				# Disabling tests to see if they solve timeout issues; see https://github.com/pytorch/pytorch/issues/70015

				# python tools/download_mnist.py --quiet -d test/cpp/api/mnist

				# OMP_NUM_THREADS=2 TORCH_CPP_TEST_MNIST_PATH="test/cpp/api/mnist" build/bin/test_api

				time python test/run_test.py --verbose -i distributed/test_c10d_common

				time python test/run_test.py --verbose -i distributed/test_c10d_gloo

				time python test/run_test.py --verbose -i distributed/test_c10d_nccl

				time python test/run_test.py --verbose -i distributed/test_c10d_spawn_gloo

				time python test/run_test.py --verbose -i distributed/test_c10d_spawn_nccl

				time python test/run_test.py --verbose -i distributed/test_compute_comm_reordering

				time python test/run_test.py --verbose -i distributed/test_store

				time python test/run_test.py --verbose -i distributed/test_symmetric_memory

				time python test/run_test.py --verbose -i distributed/test_pg_wrapper

				time python test/run_test.py --verbose -i distributed/rpc/cuda/test_tensorpipe_agent

				# FSDP tests

				for f in test/distributed/fsdp/*.py ; do time python test/run_test.py --verbose -i "${f#*/}" ; done

				# ShardedTensor tests

				time python test/run_test.py --verbose -i distributed/checkpoint/test_checkpoint

				time python test/run_test.py --verbose -i distributed/checkpoint/test_file_system_checkpoint

				time python test/run_test.py --verbose -i distributed/_shard/sharding_spec/test_sharding_spec

				time python test/run_test.py --verbose -i distributed/_shard/sharding_plan/test_sharding_plan

				time python test/run_test.py --verbose -i distributed/_shard/sharded_tensor/test_sharded_tensor

				time python test/run_test.py --verbose -i distributed/_shard/sharded_tensor/test_sharded_tensor_reshard

				if [[ "${SHARD_NUMBER:-2}" == "2" ]]; then

				    time python test/run_test.py --include test_cuda_multigpu test_cuda_primary_ctx --verbose

				# functional collective tests

				time python test/run_test.py --verbose -i distributed/test_functional_api

				    # Disabling tests to see if they solve timeout issues; see https://github.com/pytorch/pytorch/issues/70015

				    # python tools/download_mnist.py --quiet -d test/cpp/api/mnist

				    # OMP_NUM_THREADS=2 TORCH_CPP_TEST_MNIST_PATH="test/cpp/api/mnist" build/bin/test_api

				    time python test/run_test.py --verbose -i distributed/test_c10d_common

				    time python test/run_test.py --verbose -i distributed/test_c10d_gloo

				    time python test/run_test.py --verbose -i distributed/test_c10d_nccl

				    time python test/run_test.py --verbose -i distributed/test_c10d_spawn_gloo

				    time python test/run_test.py --verbose -i distributed/test_c10d_spawn_nccl

				    time python test/run_test.py --verbose -i distributed/test_compute_comm_reordering

				    time python test/run_test.py --verbose -i distributed/test_store

				    time python test/run_test.py --verbose -i distributed/test_symmetric_memory

				    time python test/run_test.py --verbose -i distributed/test_pg_wrapper

				    time python test/run_test.py --verbose -i distributed/rpc/cuda/test_tensorpipe_agent

				# DTensor tests

				time python test/run_test.py --verbose -i distributed/_tensor/test_random_ops

				time python test/run_test.py --verbose -i distributed/_tensor/test_dtensor_compile

				    # ShardedTensor tests

				    time python test/run_test.py --verbose -i distributed/checkpoint/test_checkpoint

				    time python test/run_test.py --verbose -i distributed/checkpoint/test_file_system_checkpoint

				    time python test/run_test.py --verbose -i distributed/_shard/sharding_spec/test_sharding_spec

				    time python test/run_test.py --verbose -i distributed/_shard/sharding_plan/test_sharding_plan

				    time python test/run_test.py --verbose -i distributed/_shard/sharded_tensor/test_sharded_tensor

				    time python test/run_test.py --verbose -i distributed/_shard/sharded_tensor/test_sharded_tensor_reshard

				# DeviceMesh test

				time python test/run_test.py --verbose -i distributed/test_device_mesh

				    # functional collective tests

				    time python test/run_test.py --verbose -i distributed/test_functional_api

				# DTensor/TP tests

				time python test/run_test.py --verbose -i distributed/tensor/parallel/test_tp_examples

				time python test/run_test.py --verbose -i distributed/tensor/parallel/test_tp_random_state

				    # DTensor tests

				    time python test/run_test.py --verbose -i distributed/tensor/test_random_ops

				    time python test/run_test.py --verbose -i distributed/tensor/test_dtensor_compile

				# FSDP2 tests

				time python test/run_test.py --verbose -i distributed/_composable/fsdp/test_fully_shard_training -- -k test_2d_mlp_with_nd_mesh

				    # DeviceMesh test

				    time python test/run_test.py --verbose -i distributed/test_device_mesh

				# ND composability tests

				time python test/run_test.py --verbose -i distributed/_composable/test_composability/test_2d_composability

				time python test/run_test.py --verbose -i distributed/_composable/test_composability/test_pp_composability

				    # DTensor/TP tests

				    time python test/run_test.py --verbose -i distributed/tensor/parallel/test_tp_examples

				    time python test/run_test.py --verbose -i distributed/tensor/parallel/test_tp_random_state

				# Other tests

				time python test/run_test.py --verbose -i test_cuda_primary_ctx

				time python test/run_test.py --verbose -i test_optim -- -k test_forloop_goes_right_direction_multigpu

				time python test/run_test.py --verbose -i test_optim -- -k test_mixed_device_dtype

				time python test/run_test.py --verbose -i test_foreach -- -k test_tensors_grouping

				    # FSDP2 tests

				    time python test/run_test.py --verbose -i distributed/_composable/fsdp/test_fully_shard_training -- -k test_2d_mlp_with_nd_mesh

				    # ND composability tests

				    time python test/run_test.py --verbose -i distributed/_composable/test_composability/test_2d_composability

				    time python test/run_test.py --verbose -i distributed/_composable/test_composability/test_pp_composability

				    # Other tests

				    time python test/run_test.py --verbose -i test_cuda_primary_ctx

				    time python test/run_test.py --verbose -i test_optim -- -k test_forloop_goes_right_direction_multigpu

				    time python test/run_test.py --verbose -i test_optim -- -k test_mixed_device_dtype

				    time python test/run_test.py --verbose -i test_foreach -- -k test_tensors_grouping

				fi

				assert_git_not_dirty

									
										4

.ci/pytorch/python_doc_push_script.sh
									
												View File
												
				@ -7,7 +7,7 @@ source "$pt_checkout/.ci/pytorch/common_utils.sh"

				echo "python_doc_push_script.sh: Invoked with $*"

				set -ex

				set -ex -o pipefail

				# for statements like ${1:-${DOCS_INSTALL_PATH:-docs/}}

				# the order of operations goes:

				@ -63,7 +63,7 @@ build_docs () {

				    echo "(tried to echo the WARNINGS above the ==== line)"

				    echo =========================

				  fi

				  set -ex

				  set -ex -o pipefail

				  return $code

				}

									
										436

.ci/pytorch/run_tests.sh
									
										Executable file
									
												View File
												
				@ -0,0 +1,436 @@

				#!/bin/bash

				# shellcheck disable=SC2086,SC2048,SC2068,SC2145,SC2034,SC2207,SC2143

				# TODO: Re-enable shellchecks above

				set -eux -o pipefail

				# Essentially runs pytorch/test/run_test.py, but keeps track of which tests to

				# skip in a centralized place.

				#

				# TODO Except for a few tests, this entire file is a giant TODO. Why are these

				# tests # failing?

				# TODO deal with Windows

				# This script expects to be in the pytorch root folder

				if [[ ! -d 'test' || ! -f 'test/run_test.py' ]]; then

				    echo "run_tests.sh expects to be run from the Pytorch root directory " \

				         "but I'm actually in $(pwd)"

				    exit 2

				fi

				# Allow master skip of all tests

				if [[ -n "${SKIP_ALL_TESTS:-}" ]]; then

				    exit 0

				fi

				# If given specific test params then just run those

				if [[ -n "${RUN_TEST_PARAMS:-}" ]]; then

				    echo "$(date) :: Calling user-command $(pwd)/test/run_test.py ${RUN_TEST_PARAMS[@]}"

				    python test/run_test.py ${RUN_TEST_PARAMS[@]}

				    exit 0

				fi

				# Function to retry functions that sometimes timeout or have flaky failures

				retry () {

				    $*  || (sleep 1 && $*) || (sleep 2 && $*) || (sleep 4 && $*) || (sleep 8 && $*)

				}

				# Parameters

				##############################################################################

				if [[ "$#" != 3 ]]; then

				  if [[ -z "${DESIRED_PYTHON:-}" || -z "${DESIRED_CUDA:-}" || -z "${PACKAGE_TYPE:-}" ]]; then

				    echo "USAGE: run_tests.sh  PACKAGE_TYPE  DESIRED_PYTHON  DESIRED_CUDA"

				    echo "The env variable PACKAGE_TYPE must be set to 'conda' or 'manywheel' or 'libtorch'"

				    echo "The env variable DESIRED_PYTHON must be set like '2.7mu' or '3.6m' etc"

				    echo "The env variable DESIRED_CUDA must be set like 'cpu' or 'cu80' etc"

				    exit 1

				  fi

				  package_type="$PACKAGE_TYPE"

				  py_ver="$DESIRED_PYTHON"

				  cuda_ver="$DESIRED_CUDA"

				else

				  package_type="$1"

				  py_ver="$2"

				  cuda_ver="$3"

				fi

				if [[ "$cuda_ver" == 'cpu-cxx11-abi' ]]; then

				    cuda_ver="cpu"

				fi

				# cu80, cu90, cu100, cpu

				if [[ ${#cuda_ver} -eq 4 ]]; then

				    cuda_ver_majmin="${cuda_ver:2:1}.${cuda_ver:3:1}"

				elif [[ ${#cuda_ver} -eq 5 ]]; then

				    cuda_ver_majmin="${cuda_ver:2:2}.${cuda_ver:4:1}"

				fi

				NUMPY_PACKAGE=""

				if [[ ${py_ver} == "3.10" ]]; then

				    PROTOBUF_PACKAGE="protobuf>=3.17.2"

				    NUMPY_PACKAGE="numpy>=1.21.2"

				else

				    PROTOBUF_PACKAGE="protobuf=3.14.0"

				fi

				# Environment initialization

				if [[ "$(uname)" == Darwin ]]; then

				    # Install the testing dependencies

				    retry conda install -yq future hypothesis ${NUMPY_PACKAGE} ${PROTOBUF_PACKAGE} pytest setuptools six typing_extensions pyyaml

				else

				    retry pip install -qr requirements.txt || true

				    retry pip install -q hypothesis protobuf pytest setuptools || true

				    numpy_ver=1.15

				    case "$(python --version 2>&1)" in

				      *2* | *3.5* | *3.6*)

				        numpy_ver=1.11

				        ;;

				    esac

				    retry pip install -q "numpy==${numpy_ver}" || true

				fi

				echo "Testing with:"

				pip freeze

				conda list || true

				##############################################################################

				# Smoke tests

				##############################################################################

				# TODO use check_binary.sh, which requires making sure it runs on Windows

				pushd /

				echo "Smoke testing imports"

				python -c 'import torch'

				# Test that MKL is there

				if [[ "$(uname)" == 'Darwin' && "$package_type" == *wheel ]]; then

				    echo 'Not checking for MKL on Darwin wheel packages'

				else

				    echo "Checking that MKL is available"

				    python -c 'import torch; exit(0 if torch.backends.mkl.is_available() else 1)'

				fi

				if [[ "$OSTYPE" == "msys" ]]; then

				    GPUS=$(wmic path win32_VideoController get name)

				    if [[ ! "$GPUS" == *NVIDIA* ]]; then

				        echo "Skip CUDA tests for machines without a Nvidia GPU card"

				        exit 0

				    fi

				fi

				# Test that the version number is consistent during building and testing

				if [[ "$PYTORCH_BUILD_NUMBER" -gt 1 ]]; then

				    expected_version="${PYTORCH_BUILD_VERSION}.post${PYTORCH_BUILD_NUMBER}"

				else

				    expected_version="${PYTORCH_BUILD_VERSION}"

				fi

				echo "Checking that we are testing the package that is just built"

				python -c "import torch; exit(0 if torch.__version__ == '$expected_version' else 1)"

				# Test that CUDA builds are setup correctly

				if [[ "$cuda_ver" != 'cpu' ]]; then

				    cuda_installed=1

				    nvidia-smi || cuda_installed=0

				    if [[ "$cuda_installed" == 0 ]]; then

				      echo "Skip CUDA tests for machines without a Nvidia GPU card"

				    else

				      # Test CUDA archs

				      echo "Checking that CUDA archs are setup correctly"

				      timeout 20 python -c 'import torch; torch.randn([3,5]).cuda()'

				      # These have to run after CUDA is initialized

				      echo "Checking that magma is available"

				      python -c 'import torch; torch.rand(1).cuda(); exit(0 if torch.cuda.has_magma else 1)'

				      echo "Checking that CuDNN is available"

				      python -c 'import torch; exit(0 if torch.backends.cudnn.is_available() else 1)'

				    fi

				fi

				# Check that OpenBlas is not linked to on MacOS

				if [[ "$(uname)" == 'Darwin' ]]; then

				    echo "Checking the OpenBLAS is not linked to"

				    all_dylibs=($(find "$(python -c "import site; print(site.getsitepackages()[0])")"/torch -name '*.dylib'))

				    for dylib in "${all_dylibs[@]}"; do

				        if [[ -n "$(otool -L $dylib | grep -i openblas)" ]]; then

				            echo "Found openblas as a dependency of $dylib"

				            echo "Full dependencies is: $(otool -L $dylib)"

				            exit 1

				        fi

				    done

				    echo "Checking that OpenMP is available"

				    python -c "import torch; exit(0 if torch.backends.openmp.is_available() else 1)"

				fi

				popd

				# TODO re-enable the other tests after the nightlies are moved to CI. This is

				# because the binaries keep breaking, often from additional tests, that aren't

				# real problems. Once these are on circleci and a smoke-binary-build is added

				# to PRs then this should stop happening and these can be re-enabled.

				echo "Not running unit tests. Hopefully these problems are caught by CI"

				exit 0

				##############################################################################

				# Running unit tests (except not right now)

				##############################################################################

				echo "$(date) :: Starting tests for $package_type package for python$py_ver and $cuda_ver"

				# We keep track of exact tests to skip, as otherwise we would be hardly running

				# any tests. But b/c of issues working with pytest/normal-python-test/ and b/c

				# of special snowflake tests in test/run_test.py we also take special care of

				# those

				tests_to_skip=()

				#

				# Entire file exclusions

				##############################################################################

				entire_file_exclusions=("-x")

				# cpp_extensions doesn't work with pytest, so we exclude it from the pytest run

				# here and then manually run it later. Note that this is only because this

				# entire_fil_exclusions flag is only passed to the pytest run

				entire_file_exclusions+=("cpp_extensions")

				# TODO temporary line to fix next days nightlies, but should be removed when

				# issue is fixed

				entire_file_exclusions+=('type_info')

				if [[ "$cuda_ver" == 'cpu' ]]; then

				    # test/test_cuda.py exits early if the installed torch is not built with

				    # CUDA, but the exit doesn't work when running with pytest, so pytest will

				    # still try to run all the CUDA tests and then fail

				    entire_file_exclusions+=("cuda")

				    entire_file_exclusions+=("nccl")

				fi

				if [[ "$(uname)" == 'Darwin' || "$OSTYPE" == "msys" ]]; then

				    # pytest on Mac doesn't like the exits in these files

				    entire_file_exclusions+=('c10d')

				    entire_file_exclusions+=('distributed')

				    # pytest doesn't mind the exit but fails the tests. On Mac we run this

				    # later without pytest

				    entire_file_exclusions+=('thd_distributed')

				fi

				#

				# Universal flaky tests

				##############################################################################

				# RendezvousEnvTest sometimes hangs forever

				# Otherwise it will fail on CUDA with

				#   Traceback (most recent call last):

				#     File "test_c10d.py", line 179, in test_common_errors

				#       next(gen)

				#   AssertionError: ValueError not raised

				tests_to_skip+=('RendezvousEnvTest and test_common_errors')

				# This hung forever once on conda_3.5_cu92

				tests_to_skip+=('TestTorch and test_sum_dim')

				# test_trace_warn isn't actually flaky, but it doesn't work with pytest so we

				# just skip it

				tests_to_skip+=('TestJit and test_trace_warn')

				#

				# Python specific flaky tests

				##############################################################################

				# test_dataloader.py:721: AssertionError

				# looks like a timeout, but interestingly only appears on python 3

				if [[ "$py_ver" == 3* ]]; then

				    tests_to_skip+=('TestDataLoader and test_proper_exit')

				fi

				#

				# CUDA flaky tests, all package types

				##############################################################################

				if [[ "$cuda_ver" != 'cpu' ]]; then

				    #

				    # DistributedDataParallelTest

				    # All of these seem to fail

				    tests_to_skip+=('DistributedDataParallelTest')

				    #

				    # RendezvousEnvTest

				    # Traceback (most recent call last):

				    #   File "test_c10d.py", line 201, in test_nominal

				    #     store0, rank0, size0 = next(gen0)

				    #   File "/opt/python/cp36-cp36m/lib/python3.6/site-packages/torch/distributed/rendezvous.py", line 131, in _env_rendezvous_handler

				    #     store = TCPStore(master_addr, master_port, start_daemon)

				    # RuntimeError: Address already in use

				    tests_to_skip+=('RendezvousEnvTest and test_nominal')

				    #

				    # TestCppExtension

				    #

				    # Traceback (most recent call last):

				    #   File "test_cpp_extensions.py", line 134, in test_jit_cudnn_extension

				    #     with_cuda=True)

				    #   File "/opt/python/cp35-cp35m/lib/python3.5/site-packages/torch/utils/cpp_extension.py", line 552, in load

				    #     with_cuda)

				    #   File "/opt/python/cp35-cp35m/lib/python3.5/site-packages/torch/utils/cpp_extension.py", line 729, in _jit_compile

				    #     return _import_module_from_library(name, build_directory)

				    #   File "/opt/python/cp35-cp35m/lib/python3.5/site-packages/torch/utils/cpp_extension.py", line 867, in _import_module_from_library

				    #     return imp.load_module(module_name, file, path, description)

				    #   File "/opt/python/cp35-cp35m/lib/python3.5/imp.py", line 243, in load_module

				    #     return load_dynamic(name, filename, file)

				    #   File "/opt/python/cp35-cp35m/lib/python3.5/imp.py", line 343, in load_dynamic

				    #     return _load(spec)

				    #   File "<frozen importlib._bootstrap>", line 693, in _load

				    #   File "<frozen importlib._bootstrap>", line 666, in _load_unlocked

				    #   File "<frozen importlib._bootstrap>", line 577, in module_from_spec

				    #   File "<frozen importlib._bootstrap_external>", line 938, in create_module

				    #   File "<frozen importlib._bootstrap>", line 222, in _call_with_frames_removed

				    # ImportError: libcudnn.so.7: cannot open shared object file: No such file or directory

				    tests_to_skip+=('TestCppExtension and test_jit_cudnn_extension')

				    #

				    # TestCuda

				    #

				    # 3.7_cu80

				    #  RuntimeError: CUDA error: out of memory

				    tests_to_skip+=('TestCuda and test_arithmetic_large_tensor')

				    # 3.7_cu80

				    # RuntimeError: cuda runtime error (2) : out of memory at /opt/conda/conda-bld/pytorch-nightly_1538097262541/work/aten/src/THC/THCTensorCopy.cu:205

				    tests_to_skip+=('TestCuda and test_autogpu')

				    #

				    # TestDistBackend

				    #

				    # Traceback (most recent call last):

				    #   File "test_thd_distributed.py", line 1046, in wrapper

				    #     self._join_and_reduce(fn)

				    #   File "test_thd_distributed.py", line 1108, in _join_and_reduce

				    #     self.assertEqual(p.exitcode, first_process.exitcode)

				    #   File "/pytorch/test/common.py", line 399, in assertEqual

				    #     super(TestCase, self).assertEqual(x, y, message)

				    # AssertionError: None != 77 :

				    tests_to_skip+=('TestDistBackend and test_all_gather_group')

				    tests_to_skip+=('TestDistBackend and test_all_reduce_group_max')

				    tests_to_skip+=('TestDistBackend and test_all_reduce_group_min')

				    tests_to_skip+=('TestDistBackend and test_all_reduce_group_sum')

				    tests_to_skip+=('TestDistBackend and test_all_reduce_group_product')

				    tests_to_skip+=('TestDistBackend and test_barrier_group')

				    tests_to_skip+=('TestDistBackend and test_broadcast_group')

				    # Traceback (most recent call last):

				    #   File "test_thd_distributed.py", line 1046, in wrapper

				    #     self._join_and_reduce(fn)

				    #   File "test_thd_distributed.py", line 1108, in _join_and_reduce

				    #     self.assertEqual(p.exitcode, first_process.exitcode)

				    #   File "/pytorch/test/common.py", line 397, in assertEqual

				    #     super(TestCase, self).assertLessEqual(abs(x - y), prec, message)

				    # AssertionError: 12 not less than or equal to 1e-05

				    tests_to_skip+=('TestDistBackend and test_barrier')

				    # Traceback (most recent call last):

				    #   File "test_distributed.py", line 1267, in wrapper

				    #     self._join_and_reduce(fn)

				    #   File "test_distributed.py", line 1350, in _join_and_reduce

				    #     self.assertEqual(p.exitcode, first_process.exitcode)

				    #   File "/pytorch/test/common.py", line 399, in assertEqual

				    #     super(TestCase, self).assertEqual(x, y, message)

				    # AssertionError: None != 1

				    tests_to_skip+=('TestDistBackend and test_broadcast')

				    # Memory leak very similar to all the conda ones below, but appears on manywheel

				    # 3.6m_cu80

				    # AssertionError: 1605632 not less than or equal to 1e-05 : __main__.TestEndToEndHybridFrontendModels.test_vae_cuda leaked 1605632 bytes CUDA memory on device 0

				    tests_to_skip+=('TestEndToEndHybridFrontendModels and test_vae_cuda')

				    # ________________________ TestNN.test_embedding_bag_cuda ________________________

				    #

				    # self = <test_nn.TestNN testMethod=test_embedding_bag_cuda>

				    # dtype = torch.float32

				    #

				    #     @unittest.skipIf(not TEST_CUDA, "CUDA unavailable")

				    #     @repeat_test_for_types(ALL_TENSORTYPES)

				    #     @skipIfRocm

				    #     def test_embedding_bag_cuda(self, dtype=torch.float):

				    #         self._test_EmbeddingBag(True, 'sum', False, dtype)

				    #         self._test_EmbeddingBag(True, 'mean', False, dtype)

				    #         self._test_EmbeddingBag(True, 'max', False, dtype)

				    #         if dtype != torch.half:

				    #             # torch.cuda.sparse.HalfTensor is not enabled.

				    #             self._test_EmbeddingBag(True, 'sum', True, dtype)

				    # >           self._test_EmbeddingBag(True, 'mean', True, dtype)

				    #

				    # test_nn.py:2144:

				    # _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

				    # test_nn.py:2062: in _test_EmbeddingBag

				    #     _test_vs_Embedding(N, D, B, L)

				    # test_nn.py:2059: in _test_vs_Embedding

				    #     self.assertEqual(es_weight_grad, e.weight.grad, needed_prec)

				    # common.py:373: in assertEqual

				    #     assertTensorsEqual(x, y)

				    # common.py:365: in assertTensorsEqual

				    #     self.assertLessEqual(max_err, prec, message)

				    # E   AssertionError: tensor(0.0000, device='cuda:0', dtype=torch.float32) not less than or equal to 2e-05 :

				    #  1 failed, 1202 passed, 19 skipped, 2 xfailed, 796 warnings in 1166.73 seconds =

				    # Traceback (most recent call last):

				    #   File "test/run_test.py", line 391, in <module>

				    #     main()

				    #   File "test/run_test.py", line 383, in main

				    #     raise RuntimeError(message)

				    tests_to_skip+=('TestNN and test_embedding_bag_cuda')

				fi

				##############################################################################

				# MacOS specific flaky tests

				##############################################################################

				if [[ "$(uname)" == 'Darwin' ]]; then

				    # TestCppExtensions by default uses a temp folder in /tmp. This doesn't

				    # work for this Mac machine cause there is only one machine and /tmp is

				    # shared. (All the linux builds are on docker so have their own /tmp).

				    tests_to_skip+=('TestCppExtension')

				fi

				# Turn the set of tests to skip into an invocation that pytest understands

				excluded_tests_logic=''

				for exclusion in "${tests_to_skip[@]}"; do

				    if [[ -z "$excluded_tests_logic" ]]; then

				        # Only true for i==0

				        excluded_tests_logic="not ($exclusion)"

				    else

				        excluded_tests_logic="$excluded_tests_logic and not ($exclusion)"

				    fi

				done

				##############################################################################

				# Run the tests

				##############################################################################

				echo

				echo "$(date) :: Calling 'python test/run_test.py -v -p pytest ${entire_file_exclusions[@]} -- --disable-pytest-warnings -k '$excluded_tests_logic'"

				python test/run_test.py -v -p pytest ${entire_file_exclusions[@]} -- --disable-pytest-warnings -k "'" "$excluded_tests_logic" "'"

				echo

				echo "$(date) :: Finished 'python test/run_test.py -v -p pytest ${entire_file_exclusions[@]} -- --disable-pytest-warnings -k '$excluded_tests_logic'"

				# cpp_extensions don't work with pytest, so we run them without pytest here,

				# except there's a failure on CUDA builds (documented above), and

				# cpp_extensions doesn't work on a shared mac machine (also documented above)

				if [[ "$cuda_ver" == 'cpu' && "$(uname)" != 'Darwin' ]]; then

				    echo

				    echo "$(date) :: Calling 'python test/run_test.py -v -i cpp_extensions'"

				    python test/run_test.py -v -i cpp_extensions

				    echo

				    echo "$(date) :: Finished 'python test/run_test.py -v -i cpp_extensions'"

				fi

				# thd_distributed can run on Mac but not in pytest

				if [[ "$(uname)" == 'Darwin' ]]; then

				    echo

				    echo "$(date) :: Calling 'python test/run_test.py -v -i thd_distributed'"

				    python test/run_test.py -v -i thd_distributed

				    echo

				    echo "$(date) :: Finished 'python test/run_test.py -v -i thd_distributed'"

				fi

									
										130

.ci/pytorch/smoke_test/check_binary_symbols.py
									
										Executable file
									
												View File
												
				@ -0,0 +1,130 @@

				#!/usr/bin/env python3

				import concurrent.futures

				import distutils.sysconfig

				import functools

				import itertools

				import os

				import re

				from pathlib import Path

				from typing import Any, List, Tuple

				# We also check that there are [not] cxx11 symbols in libtorch

				#

				# To check whether it is using cxx11 ABI, check non-existence of symbol:

				PRE_CXX11_SYMBOLS = (

				    "std::basic_string<",

				    "std::list",

				)

				# To check whether it is using pre-cxx11 ABI, check non-existence of symbol:

				CXX11_SYMBOLS = (

				    "std::__cxx11::basic_string",

				    "std::__cxx11::list",

				)

				# NOTE: Checking the above symbols in all namespaces doesn't work, because

				# devtoolset7 always produces some cxx11 symbols even if we build with old ABI,

				# and CuDNN always has pre-cxx11 symbols even if we build with new ABI using gcc 5.4.

				# Instead, we *only* check the above symbols in the following namespaces:

				LIBTORCH_NAMESPACE_LIST = (

				    "c10::",

				    "at::",

				    "caffe2::",

				    "torch::",

				)

				def _apply_libtorch_symbols(symbols):

				    return [

				        re.compile(f"{x}.*{y}")

				        for (x, y) in itertools.product(LIBTORCH_NAMESPACE_LIST, symbols)

				    ]

				LIBTORCH_CXX11_PATTERNS = _apply_libtorch_symbols(CXX11_SYMBOLS)

				LIBTORCH_PRE_CXX11_PATTERNS = _apply_libtorch_symbols(PRE_CXX11_SYMBOLS)

				@functools.lru_cache(100)

				def get_symbols(lib: str) -> List[Tuple[str, str, str]]:

				    from subprocess import check_output

				    lines = check_output(f'nm "{lib}"|c++filt', shell=True)

				    return [x.split(" ", 2) for x in lines.decode("latin1").split("\n")[:-1]]

				def grep_symbols(lib: str, patterns: List[Any]) -> List[str]:

				    def _grep_symbols(

				        symbols: List[Tuple[str, str, str]], patterns: List[Any]

				    ) -> List[str]:

				        rc = []

				        for _s_addr, _s_type, s_name in symbols:

				            for pattern in patterns:

				                if pattern.match(s_name):

				                    rc.append(s_name)

				                    continue

				        return rc

				    all_symbols = get_symbols(lib)

				    num_workers = 32

				    chunk_size = (len(all_symbols) + num_workers - 1) // num_workers

				    def _get_symbols_chunk(i):

				        return all_symbols[i * chunk_size : (i + 1) * chunk_size]

				    with concurrent.futures.ThreadPoolExecutor(max_workers=32) as executor:

				        tasks = [

				            executor.submit(_grep_symbols, _get_symbols_chunk(i), patterns)

				            for i in range(num_workers)

				        ]

				        return functools.reduce(list.__add__, (x.result() for x in tasks), [])

				def check_lib_symbols_for_abi_correctness(lib: str, pre_cxx11_abi: bool = True) -> None:

				    print(f"lib: {lib}")

				    cxx11_symbols = grep_symbols(lib, LIBTORCH_CXX11_PATTERNS)

				    pre_cxx11_symbols = grep_symbols(lib, LIBTORCH_PRE_CXX11_PATTERNS)

				    num_cxx11_symbols = len(cxx11_symbols)

				    num_pre_cxx11_symbols = len(pre_cxx11_symbols)

				    print(f"num_cxx11_symbols: {num_cxx11_symbols}")

				    print(f"num_pre_cxx11_symbols: {num_pre_cxx11_symbols}")

				    if pre_cxx11_abi:

				        if num_cxx11_symbols > 0:

				            raise RuntimeError(

				                f"Found cxx11 symbols, but there shouldn't be any, see: {cxx11_symbols[:100]}"

				            )

				        if num_pre_cxx11_symbols < 1000:

				            raise RuntimeError("Didn't find enough pre-cxx11 symbols.")

				        # Check for no recursive iterators, regression test for https://github.com/pytorch/pytorch/issues/133437

				        rec_iter_symbols = grep_symbols(

				            lib, [re.compile("std::filesystem::recursive_directory_iterator.*")]

				        )

				        if len(rec_iter_symbols) > 0:

				            raise RuntimeError(

				                f"recursive_directory_iterator in used pre-CXX11 binaries, see; {rec_iter_symbols}"

				            )

				    else:

				        if num_pre_cxx11_symbols > 0:

				            raise RuntimeError(

				                f"Found pre-cxx11 symbols, but there shouldn't be any, see: {pre_cxx11_symbols[:100]}"

				            )

				        if num_cxx11_symbols < 100:

				            raise RuntimeError("Didn't find enought cxx11 symbols")

				def main() -> None:

				    if "install_root" in os.environ:

				        install_root = Path(os.getenv("install_root"))  # noqa: SIM112

				    else:

				        if os.getenv("PACKAGE_TYPE") == "libtorch":

				            install_root = Path(os.getcwd())

				        else:

				            install_root = Path(distutils.sysconfig.get_python_lib()) / "torch"

				    libtorch_cpu_path = install_root / "lib" / "libtorch_cpu.so"

				    pre_cxx11_abi = "cxx11-abi" not in os.getenv("DESIRED_DEVTOOLSET", "")

				    check_lib_symbols_for_abi_correctness(libtorch_cpu_path, pre_cxx11_abi)

				if __name__ == "__main__":

				    main()

									
										205

.ci/pytorch/smoke_test/max_autotune.py
									
										Normal file
									
												View File
												
				@ -0,0 +1,205 @@

				import argparse

				from torchvision import datasets, transforms

				import torch

				import torch.nn as nn

				import torch.nn.functional as F

				import torch.optim as optim

				from torch.optim.lr_scheduler import StepLR

				class Net(nn.Module):

				    def __init__(self):

				        super(Net, self).__init__()  # noqa: UP008

				        self.conv1 = nn.Conv2d(1, 32, 3, 1)

				        self.conv2 = nn.Conv2d(32, 64, 3, 1)

				        self.dropout1 = nn.Dropout(0.25)

				        self.dropout2 = nn.Dropout(0.5)

				        self.fc1 = nn.Linear(9216, 128)

				        self.fc2 = nn.Linear(128, 10)

				    def forward(self, x):

				        x = self.conv1(x)

				        x = F.relu(x)

				        x = self.conv2(x)

				        x = F.relu(x)

				        x = F.max_pool2d(x, 2)

				        x = self.dropout1(x)

				        x = torch.flatten(x, 1)

				        x = self.fc1(x)

				        x = F.relu(x)

				        x = self.dropout2(x)

				        x = self.fc2(x)

				        output = F.log_softmax(x, dim=1)

				        return output

				def train(args, model, device, train_loader, optimizer, epoch):

				    model.train()

				    for batch_idx, (data, target) in enumerate(train_loader):

				        data, target = data.to(device), target.to(device)

				        optimizer.zero_grad()

				        output = model(data)

				        loss = F.nll_loss(output, target)

				        loss.backward()

				        optimizer.step()

				        if batch_idx % args.log_interval == 0:

				            print(

				                f"Train Epoch: {epoch} [{batch_idx * len(data)}/{len(train_loader.dataset)} ({100. * batch_idx / len(train_loader):.0f}%)]\tLoss: {loss.item():.6f}"  # noqa: B950

				            )

				            if args.dry_run:

				                break

				def test(model, device, test_loader):

				    model.eval()

				    test_loss = 0

				    correct = 0

				    with torch.no_grad():

				        for data, target in test_loader:

				            data, target = data.to(device), target.to(device)

				            output = model(data)

				            test_loss += F.nll_loss(

				                output, target, reduction="sum"

				            ).item()  # sum up batch loss

				            pred = output.argmax(

				                dim=1, keepdim=True

				            )  # get the index of the max log-probability

				            correct += pred.eq(target.view_as(pred)).sum().item()

				    test_loss /= len(test_loader.dataset)

				    print(

				        f"\nTest set: Average loss: {test_loss:.4f}, Accuracy: {correct}/{len(test_loader.dataset)} ({100. * correct / len(test_loader.dataset):.0f}%)\n"  # noqa: B950

				    )

				def timed(fn):

				    start = torch.cuda.Event(enable_timing=True)

				    end = torch.cuda.Event(enable_timing=True)

				    start.record()

				    result = fn()

				    end.record()

				    torch.cuda.synchronize()

				    return result, start.elapsed_time(end) / 1000

				def main():

				    # Training settings

				    parser = argparse.ArgumentParser(description="PyTorch MNIST Example")

				    parser.add_argument(

				        "--batch-size",

				        type=int,

				        default=64,

				        metavar="N",

				        help="input batch size for training (default: 64)",

				    )

				    parser.add_argument(

				        "--test-batch-size",

				        type=int,

				        default=1000,

				        metavar="N",

				        help="input batch size for testing (default: 1000)",

				    )

				    parser.add_argument(

				        "--epochs",

				        type=int,

				        default=4,

				        metavar="N",

				        help="number of epochs to train (default: 14)",

				    )

				    parser.add_argument(

				        "--lr",

				        type=float,

				        default=1.0,

				        metavar="LR",

				        help="learning rate (default: 1.0)",

				    )

				    parser.add_argument(

				        "--gamma",

				        type=float,

				        default=0.7,

				        metavar="M",

				        help="Learning rate step gamma (default: 0.7)",

				    )

				    parser.add_argument(

				        "--no-cuda", action="store_true", default=False, help="disables CUDA training"

				    )

				    parser.add_argument(

				        "--no-mps",

				        action="store_true",

				        default=False,

				        help="disables macOS GPU training",

				    )

				    parser.add_argument(

				        "--dry-run",

				        action="store_true",

				        default=False,

				        help="quickly check a single pass",

				    )

				    parser.add_argument(

				        "--seed", type=int, default=1, metavar="S", help="random seed (default: 1)"

				    )

				    parser.add_argument(

				        "--log-interval",

				        type=int,

				        default=100,

				        metavar="N",

				        help="how many batches to wait before logging training status",

				    )

				    parser.add_argument(

				        "--save-model",

				        action="store_true",

				        default=False,

				        help="For Saving the current Model",

				    )

				    args = parser.parse_args()

				    use_cuda = not args.no_cuda and torch.cuda.is_available()

				    use_mps = not args.no_mps and torch.backends.mps.is_available()

				    torch.manual_seed(args.seed)

				    torch.backends.cuda.matmul.allow_tf32 = True

				    if use_cuda:

				        device = torch.device("cuda")

				    elif use_mps:

				        device = torch.device("mps")

				    else:

				        device = torch.device("cpu")

				    train_kwargs = {"batch_size": args.batch_size}

				    test_kwargs = {"batch_size": args.test_batch_size}

				    if use_cuda:

				        cuda_kwargs = {"num_workers": 1, "pin_memory": True, "shuffle": True}

				        train_kwargs.update(cuda_kwargs)

				        test_kwargs.update(cuda_kwargs)

				    transform = transforms.Compose(

				        [transforms.ToTensor(), transforms.Normalize((0.1307,), (0.3081,))]

				    )

				    dataset1 = datasets.MNIST("../data", train=True, download=True, transform=transform)

				    dataset2 = datasets.MNIST("../data", train=False, transform=transform)

				    train_loader = torch.utils.data.DataLoader(dataset1, **train_kwargs)

				    test_loader = torch.utils.data.DataLoader(dataset2, **test_kwargs)

				    model = Net().to(device)

				    opt_model = torch.compile(model, mode="max-autotune")

				    optimizer = optim.Adadelta(opt_model.parameters(), lr=args.lr)

				    scheduler = StepLR(optimizer, step_size=1, gamma=args.gamma)

				    for epoch in range(1, args.epochs + 1):

				        print(

				            f"Training Time: {timed(lambda: train(args, opt_model, device, train_loader, optimizer, epoch))[1]}"

				        )

				        print(

				            f"Evaluation Time: {timed(lambda: test(opt_model, device, test_loader))[1]}"

				        )

				        scheduler.step()

				    if args.save_model:

				        torch.save(opt_model.state_dict(), "mnist_cnn.pt")

				if __name__ == "__main__":

				    main()

									
										394

.ci/pytorch/smoke_test/smoke_test.py
									
										Normal file
									
												View File
												
				@ -0,0 +1,394 @@

				import argparse

				import importlib

				import json

				import os

				import re

				import subprocess

				import sys

				from pathlib import Path

				import torch

				import torch._dynamo

				import torch.nn as nn

				import torch.nn.functional as F

				if "MATRIX_GPU_ARCH_VERSION" in os.environ:

				    gpu_arch_ver = os.getenv("MATRIX_GPU_ARCH_VERSION")

				else:

				    gpu_arch_ver = os.getenv("GPU_ARCH_VERSION")  # Use fallback if available

				gpu_arch_type = os.getenv("MATRIX_GPU_ARCH_TYPE")

				channel = os.getenv("MATRIX_CHANNEL")

				package_type = os.getenv("MATRIX_PACKAGE_TYPE")

				target_os = os.getenv("TARGET_OS", sys.platform)

				BASE_DIR = Path(__file__).parent.parent.parent

				is_cuda_system = gpu_arch_type == "cuda"

				NIGHTLY_ALLOWED_DELTA = 3

				MODULES = [

				    {

				        "name": "torchvision",

				        "repo": "https://github.com/pytorch/vision.git",

				        "smoke_test": "./vision/test/smoke_test.py",

				        "extension": "extension",

				        "repo_name": "vision",

				    },

				    {

				        "name": "torchaudio",

				        "repo": "https://github.com/pytorch/audio.git",

				        "smoke_test": "./audio/test/smoke_test/smoke_test.py --no-ffmpeg",

				        "extension": "_extension",

				        "repo_name": "audio",

				    },

				]

				class Net(nn.Module):

				    def __init__(self):

				        super().__init__()

				        self.conv1 = nn.Conv2d(1, 32, 3, 1)

				        self.conv2 = nn.Conv2d(32, 64, 3, 1)

				        self.fc1 = nn.Linear(9216, 1)

				    def forward(self, x):

				        x = self.conv1(x)

				        x = self.conv2(x)

				        x = F.max_pool2d(x, 2)

				        x = torch.flatten(x, 1)

				        output = self.fc1(x)

				        return output

				def load_json_from_basedir(filename: str):

				    try:

				        with open(BASE_DIR / filename) as fptr:

				            return json.load(fptr)

				    except FileNotFoundError as exc:

				        raise ImportError(f"File {filename} not found error: {exc.strerror}") from exc

				    except json.JSONDecodeError as exc:

				        raise ImportError(f"Invalid JSON {filename}") from exc

				def read_release_matrix():

				    return load_json_from_basedir("release_matrix.json")

				def test_numpy():

				    import numpy as np

				    x = np.arange(5)

				    torch.tensor(x)

				def check_version(package: str) -> None:

				    release_version = os.getenv("RELEASE_VERSION")

				    # if release_version is specified, use it to validate the packages

				    if release_version:

				        release_matrix = read_release_matrix()

				        stable_version = release_matrix["torch"]

				    else:

				        stable_version = os.getenv("MATRIX_STABLE_VERSION")

				    # only makes sense to check nightly package where dates are known

				    if channel == "nightly":

				        check_nightly_binaries_date(package)

				    elif stable_version is not None:

				        if not torch.__version__.startswith(stable_version):

				            raise RuntimeError(

				                f"Torch version mismatch, expected {stable_version} for channel {channel}. But its {torch.__version__}"

				            )

				        if release_version and package == "all":

				            for module in MODULES:

				                imported_module = importlib.import_module(module["name"])

				                module_version = imported_module.__version__

				                if not module_version.startswith(release_matrix[module["name"]]):

				                    raise RuntimeError(

				                        f"{module['name']} version mismatch, expected: \

				                            {release_matrix[module['name']]} for channel {channel}. But its {module_version}"

				                    )

				                else:

				                    print(

				                        f"{module['name']} version actual: {module_version} expected: \

				                        {release_matrix[module['name']]} for channel {channel}."

				                    )

				    else:

				        print(f"Skip version check for channel {channel} as stable version is None")

				def check_nightly_binaries_date(package: str) -> None:

				    from datetime import datetime

				    format_dt = "%Y%m%d"

				    date_t_str = re.findall("dev\\d+", torch.__version__)

				    date_t_delta = datetime.now() - datetime.strptime(date_t_str[0][3:], format_dt)

				    if date_t_delta.days >= NIGHTLY_ALLOWED_DELTA:

				        raise RuntimeError(

				            f"the binaries are from {date_t_str} and are more than {NIGHTLY_ALLOWED_DELTA} days old!"

				        )

				    if package == "all":

				        for module in MODULES:

				            imported_module = importlib.import_module(module["name"])

				            module_version = imported_module.__version__

				            date_m_str = re.findall("dev\\d+", module_version)

				            date_m_delta = datetime.now() - datetime.strptime(

				                date_m_str[0][3:], format_dt

				            )

				            print(f"Nightly date check for {module['name']} version {module_version}")

				            if date_m_delta.days > NIGHTLY_ALLOWED_DELTA:

				                raise RuntimeError(

				                    f"Expected {module['name']} to be less then {NIGHTLY_ALLOWED_DELTA} days. But its {date_m_delta}"

				                )

				def test_cuda_runtime_errors_captured() -> None:

				    cuda_exception_missed = True

				    try:

				        print("Testing test_cuda_runtime_errors_captured")

				        torch._assert_async(torch.tensor(0, device="cuda"))

				        torch._assert_async(torch.tensor(0 + 0j, device="cuda"))

				    except RuntimeError as e:

				        if re.search("CUDA", f"{e}"):

				            print(f"Caught CUDA exception with success: {e}")

				            cuda_exception_missed = False

				        else:

				            raise e

				    if cuda_exception_missed:

				        raise RuntimeError("Expected CUDA RuntimeError but have not received!")

				def smoke_test_cuda(

				    package: str, runtime_error_check: str, torch_compile_check: str

				) -> None:

				    if not torch.cuda.is_available() and is_cuda_system:

				        raise RuntimeError(f"Expected CUDA {gpu_arch_ver}. However CUDA is not loaded.")

				    if package == "all" and is_cuda_system:

				        for module in MODULES:

				            imported_module = importlib.import_module(module["name"])

				            # TBD for vision move extension module to private so it will

				            # be _extention.

				            version = "N/A"

				            if module["extension"] == "extension":

				                version = imported_module.extension._check_cuda_version()

				            else:

				                version = imported_module._extension._check_cuda_version()

				            print(f"{module['name']} CUDA: {version}")

				    # torch.compile is available on macos-arm64 and Linux for python 3.8-3.13

				    if (

				        torch_compile_check == "enabled"

				        and sys.version_info < (3, 14, 0)

				        and target_os in ["linux", "linux-aarch64", "macos-arm64", "darwin"]

				    ):

				        smoke_test_compile("cuda" if torch.cuda.is_available() else "cpu")

				    if torch.cuda.is_available():

				        if torch.version.cuda != gpu_arch_ver:

				            raise RuntimeError(

				                f"Wrong CUDA version. Loaded: {torch.version.cuda} Expected: {gpu_arch_ver}"

				            )

				        print(f"torch cuda: {torch.version.cuda}")

				        # todo add cudnn version validation

				        print(f"torch cudnn: {torch.backends.cudnn.version()}")

				        print(f"cuDNN enabled? {torch.backends.cudnn.enabled}")

				        torch.cuda.init()

				        print("CUDA initialized successfully")

				        print(f"Number of CUDA devices: {torch.cuda.device_count()}")

				        for i in range(torch.cuda.device_count()):

				            print(f"Device {i}: {torch.cuda.get_device_name(i)}")

				        # nccl is availbale only on Linux

				        if sys.platform in ["linux", "linux2"]:

				            print(f"torch nccl version: {torch.cuda.nccl.version()}")

				        if runtime_error_check == "enabled":

				            test_cuda_runtime_errors_captured()

				def smoke_test_conv2d() -> None:

				    import torch.nn as nn

				    print("Testing smoke_test_conv2d")

				    # With square kernels and equal stride

				    m = nn.Conv2d(16, 33, 3, stride=2)

				    # non-square kernels and unequal stride and with padding

				    m = nn.Conv2d(16, 33, (3, 5), stride=(2, 1), padding=(4, 2))

				    assert m is not None

				    # non-square kernels and unequal stride and with padding and dilation

				    basic_conv = nn.Conv2d(

				        16, 33, (3, 5), stride=(2, 1), padding=(4, 2), dilation=(3, 1)

				    )

				    input = torch.randn(20, 16, 50, 100)

				    output = basic_conv(input)

				    if is_cuda_system:

				        print("Testing smoke_test_conv2d with cuda")

				        conv = nn.Conv2d(3, 3, 3).cuda()

				        x = torch.randn(1, 3, 24, 24, device="cuda")

				        with torch.cuda.amp.autocast():

				            out = conv(x)

				        assert out is not None

				        supported_dtypes = [torch.float16, torch.float32, torch.float64]

				        for dtype in supported_dtypes:

				            print(f"Testing smoke_test_conv2d with cuda for {dtype}")

				            conv = basic_conv.to(dtype).cuda()

				            input = torch.randn(20, 16, 50, 100, device="cuda").type(dtype)

				            output = conv(input)

				            assert output is not None

				def test_linalg(device="cpu") -> None:

				    print(f"Testing smoke_test_linalg on {device}")

				    A = torch.randn(5, 3, device=device)

				    U, S, Vh = torch.linalg.svd(A, full_matrices=False)

				    assert (

				        U.shape == A.shape

				        and S.shape == torch.Size([3])

				        and Vh.shape == torch.Size([3, 3])

				    )

				    torch.dist(A, U @ torch.diag(S) @ Vh)

				    U, S, Vh = torch.linalg.svd(A)

				    assert (

				        U.shape == torch.Size([5, 5])

				        and S.shape == torch.Size([3])

				        and Vh.shape == torch.Size([3, 3])

				    )

				    torch.dist(A, U[:, :3] @ torch.diag(S) @ Vh)

				    A = torch.randn(7, 5, 3, device=device)

				    U, S, Vh = torch.linalg.svd(A, full_matrices=False)

				    torch.dist(A, U @ torch.diag_embed(S) @ Vh)

				    if device == "cuda":

				        supported_dtypes = [torch.float32, torch.float64]

				        for dtype in supported_dtypes:

				            print(f"Testing smoke_test_linalg with cuda for {dtype}")

				            A = torch.randn(20, 16, 50, 100, device=device, dtype=dtype)

				            torch.linalg.svd(A)

				def smoke_test_compile(device: str = "cpu") -> None:

				    supported_dtypes = [torch.float16, torch.float32, torch.float64]

				    def foo(x: torch.Tensor) -> torch.Tensor:

				        return torch.sin(x) + torch.cos(x)

				    for dtype in supported_dtypes:

				        print(f"Testing smoke_test_compile for {device} and {dtype}")

				        x = torch.rand(3, 3, device=device).type(dtype)

				        x_eager = foo(x)

				        x_pt2 = torch.compile(foo)(x)

				        torch.testing.assert_close(x_eager, x_pt2)

				    # Check that SIMD were detected for the architecture

				    if device == "cpu":

				        from torch._inductor.codecache import pick_vec_isa

				        isa = pick_vec_isa()

				        if not isa:

				            raise RuntimeError("Can't detect vectorized ISA for CPU")

				        print(f"Picked CPU ISA {type(isa).__name__} bit width {isa.bit_width()}")

				    # Reset torch dynamo since we are changing mode

				    torch._dynamo.reset()

				    dtype = torch.float32

				    torch.set_float32_matmul_precision("high")

				    print(f"Testing smoke_test_compile with mode 'max-autotune' for {dtype}")

				    x = torch.rand(64, 1, 28, 28, device=device).type(torch.float32)

				    model = Net().to(device=device)

				    x_pt2 = torch.compile(model, mode="max-autotune")(x)

				def smoke_test_modules():

				    cwd = os.getcwd()

				    for module in MODULES:

				        if module["repo"]:

				            if not os.path.exists(f"{cwd}/{module['repo_name']}"):

				                print(f"Path does not exist: {cwd}/{module['repo_name']}")

				                try:

				                    subprocess.check_output(

				                        f"git clone --depth 1 {module['repo']}",

				                        stderr=subprocess.STDOUT,

				                        shell=True,

				                    )

				                except subprocess.CalledProcessError as exc:

				                    raise RuntimeError(

				                        f"Cloning {module['repo']} FAIL: {exc.returncode} Output: {exc.output}"

				                    ) from exc

				            try:

				                smoke_test_command = f"python3 {module['smoke_test']}"

				                if target_os == "windows":

				                    smoke_test_command = f"python {module['smoke_test']}"

				                output = subprocess.check_output(

				                    smoke_test_command,

				                    stderr=subprocess.STDOUT,

				                    shell=True,

				                    universal_newlines=True,

				                )

				            except subprocess.CalledProcessError as exc:

				                raise RuntimeError(

				                    f"Module {module['name']} FAIL: {exc.returncode} Output: {exc.output}"

				                ) from exc

				            else:

				                print(f"Output: \n{output}\n")

				def parse_args():

				    parser = argparse.ArgumentParser()

				    parser.add_argument(

				        "--package",

				        help="Package to include in smoke testing",

				        type=str,

				        choices=["all", "torchonly"],

				        default="all",

				    )

				    parser.add_argument(

				        "--runtime-error-check",

				        help="No Runtime Error check",

				        type=str,

				        choices=["enabled", "disabled"],

				        default="enabled",

				    )

				    parser.add_argument(

				        "--torch-compile-check",

				        help="Check torch compile",

				        type=str,

				        choices=["enabled", "disabled"],

				        default="enabled",

				    )

				    return parser.parse_args()

				def main() -> None:

				    options = parse_args()

				    print(f"torch: {torch.__version__}")

				    print(torch.__config__.parallel_info())

				    # All PyTorch binary builds should be built with OpenMP

				    if not torch.backends.openmp.is_available():

				        raise RuntimeError("PyTorch must be built with OpenMP support")

				    check_version(options.package)

				    smoke_test_conv2d()

				    test_linalg()

				    test_numpy()

				    if is_cuda_system:

				        test_linalg("cuda")

				    if options.package == "all":

				        smoke_test_modules()

				    smoke_test_cuda(

				        options.package, options.runtime_error_check, options.torch_compile_check

				    )

				if __name__ == "__main__":

				    main()

									
										214

.ci/pytorch/test.sh
									
												View File
												
				@ -4,7 +4,7 @@

				# (This is set by default in the Docker images we build, so you don't

				# need to set it yourself.

				set -ex

				set -ex -o pipefail

				# Suppress ANSI color escape sequences

				export TERM=vt100

				@ -14,7 +14,7 @@ source "$(dirname "${BASH_SOURCE[0]}")/common.sh"

				# Do not change workspace permissions for ROCm CI jobs

				# as it can leave workspace with bad permissions for cancelled jobs

				if [[ "$BUILD_ENVIRONMENT" != *rocm* ]]; then

				if [[ "$BUILD_ENVIRONMENT" != *rocm* && -d /var/lib/jenkins/workspace ]]; then

				  # Workaround for dind-rootless userid mapping (https://github.com/pytorch/ci-infra/issues/96)

				  WORKSPACE_ORIGINAL_OWNER_ID=$(stat -c '%u' "/var/lib/jenkins/workspace")

				  cleanup_workspace() {

				@ -48,17 +48,17 @@ NUM_TEST_SHARDS="${NUM_TEST_SHARDS:=1}"

				export VALGRIND=ON

				# export TORCH_INDUCTOR_INSTALL_GXX=ON

				if [[ "$BUILD_ENVIRONMENT" == *clang9* ]]; then

				  # clang9 appears to miscompile code involving c10::optional<c10::SymInt>,

				if [[ "$BUILD_ENVIRONMENT" == *clang9* || "$BUILD_ENVIRONMENT" == *xpu* ]]; then

				  # clang9 appears to miscompile code involving std::optional<c10::SymInt>,

				  # such that valgrind complains along these lines:

				  #

				  # Conditional jump or move depends on uninitialised value(s)

				  #    at 0x40303A: ~optional_base (Optional.h:281)

				  #    by 0x40303A: call (Dispatcher.h:448)

				  #    by 0x40303A: call(at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, c10::optional<c10::SymInt>) (basic.cpp:10)

				  #    by 0x40303A: call(at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, std::optional<c10::SymInt>) (basic.cpp:10)

				  #    by 0x403700: main (basic.cpp:16)

				  #  Uninitialised value was created by a stack allocation

				  #    at 0x402AAA: call(at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, c10::optional<c10::SymInt>) (basic.cpp:6)

				  #    at 0x402AAA: call(at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, std::optional<c10::SymInt>) (basic.cpp:6)

				  #

				  # The problem does not appear with gcc or newer versions of clang (we tested

				  # clang14).  So we suppress valgrind testing for clang9 specifically.

				@ -72,7 +72,7 @@ if [[ "$BUILD_ENVIRONMENT" == *clang9* ]]; then

				  #

				  # using namespace at;

				  #

				  # Tensor call(const at::Tensor & self, c10::SymIntArrayRef size, c10::SymIntArrayRef stride, c10::optional<c10::SymInt> storage_offset) {

				  # Tensor call(const at::Tensor & self, c10::SymIntArrayRef size, c10::SymIntArrayRef stride, std::optional<c10::SymInt> storage_offset) {

				  #   auto op = c10::Dispatcher::singleton()

				  #       .findSchemaOrThrow(at::_ops::as_strided::name, at::_ops::as_strided::overload_name)

				  #       .typed<at::_ops::as_strided::schema>();

				@ -81,7 +81,7 @@ if [[ "$BUILD_ENVIRONMENT" == *clang9* ]]; then

				  #

				  # int main(int argv) {

				  #   Tensor b = empty({3, 4});

				  #   auto z = call(b, b.sym_sizes(), b.sym_strides(), c10::nullopt);

				  #   auto z = call(b, b.sym_sizes(), b.sym_strides(), std::nullopt);

				  # }

				  export VALGRIND=OFF

				fi

				@ -129,7 +129,7 @@ if [[ "$TEST_CONFIG" == 'default' ]]; then

				fi

				if [[ "$TEST_CONFIG" == 'distributed' ]] && [[ "$BUILD_ENVIRONMENT" == *rocm* ]]; then

				  export HIP_VISIBLE_DEVICES=0,1

				  export HIP_VISIBLE_DEVICES=0,1,2,3

				fi

				if [[ "$TEST_CONFIG" == 'slow' ]]; then

				@ -153,6 +153,8 @@ elif [[ "$BUILD_ENVIRONMENT" == *xpu* ]]; then

				  export PYTORCH_TESTING_DEVICE_ONLY_FOR="xpu"

				  # setting PYTHON_TEST_EXTRA_OPTION

				  export PYTHON_TEST_EXTRA_OPTION="--xpu"

				  # Disable sccache for xpu test due to flaky issue https://github.com/pytorch/pytorch/issues/143585

				  sudo rm -rf /opt/cache

				fi

				if [[ "$TEST_CONFIG" == *crossref* ]]; then

				@ -169,9 +171,13 @@ fi

				if [[ "$BUILD_ENVIRONMENT" == *xpu* ]]; then

				  # Source Intel oneAPI envrioment script to enable xpu runtime related libraries

				  # refer to https://www.intel.com/content/www/us/en/developer/articles/tool/pytorch-prerequisites-for-intel-gpu/2-5.html

				  # refer to https://www.intel.com/content/www/us/en/developer/articles/tool/pytorch-prerequisites-for-intel-gpus.html

				  # shellcheck disable=SC1091

				  source /opt/intel/oneapi/compiler/latest/env/vars.sh

				  if [ -f /opt/intel/oneapi/umf/latest/env/vars.sh ]; then

				    # shellcheck disable=SC1091

				    source /opt/intel/oneapi/umf/latest/env/vars.sh

				  fi

				  # Check XPU status before testing

				  xpu-smi discovery

				fi

				@ -196,6 +202,9 @@ install_tlparse

				# ASAN test is not working

				if [[ "$BUILD_ENVIRONMENT" == *asan* ]]; then

				    export ASAN_OPTIONS=detect_leaks=0:symbolize=1:detect_stack_use_after_return=true:strict_init_order=true:detect_odr_violation=1:detect_container_overflow=0:check_initialization_order=true:debug=true

				    if [[ "$BUILD_ENVIRONMENT" == *cuda* ]]; then

				        export ASAN_OPTIONS="${ASAN_OPTIONS}:protect_shadow_gap=0"

				    fi

				    export UBSAN_OPTIONS=print_stacktrace=1:suppressions=$PWD/ubsan.supp

				    export PYTORCH_TEST_WITH_ASAN=1

				    export PYTORCH_TEST_WITH_UBSAN=1

				@ -233,8 +242,8 @@ if [[ "$BUILD_ENVIRONMENT" == *asan* ]]; then

				    # it depends on a ton of dynamic libraries that most programs aren't gonna

				    # have, and it applies to child processes.

				    # TODO: get rid of the hardcoded path

				    export LD_PRELOAD=/usr/lib/llvm-15/lib/clang/15.0.7/lib/linux/libclang_rt.asan-x86_64.so

				    LD_PRELOAD=$(clang --print-file-name=libclang_rt.asan-x86_64.so)

				    export LD_PRELOAD

				    # Disable valgrind for asan

				    export VALGRIND=OFF

				@ -281,7 +290,7 @@ test_python_shard() {

				  # modify LD_LIBRARY_PATH to ensure it has the conda env.

				  # This set of tests has been shown to be buggy without it for the split-build

				  time python test/run_test.py --exclude-jit-executor --exclude-distributed-tests $INCLUDE_CLAUSE --shard "$1" "$NUM_TEST_SHARDS" --verbose $PYTHON_TEST_EXTRA_OPTION

				  time python test/run_test.py --exclude-jit-executor --exclude-distributed-tests $INCLUDE_CLAUSE --shard "$1" "$NUM_TEST_SHARDS" --verbose $PYTHON_TEST_EXTRA_OPTION --upload-artifacts-while-running

				  assert_git_not_dirty

				}

				@ -293,7 +302,7 @@ test_python() {

				}

				test_dynamo_shard() {

				test_dynamo_wrapped_shard() {

				  if [[ -z "$NUM_TEST_SHARDS" ]]; then

				    echo "NUM_TEST_SHARDS must be defined to run a Python test shard"

				    exit 1

				@ -306,8 +315,10 @@ test_dynamo_shard() {

				    --exclude-jit-executor \

				    --exclude-distributed-tests \

				    --exclude-torch-export-tests \

				    --exclude-aot-dispatch-tests \

				    --shard "$1" "$NUM_TEST_SHARDS" \

				    --verbose

				    --verbose \

				    --upload-artifacts-while-running

				  assert_git_not_dirty

				}

				@ -318,8 +329,9 @@ test_inductor_distributed() {

				  python test/run_test.py -i inductor/test_aot_inductor.py -k test_non_default_cuda_device --verbose

				  python test/run_test.py -i inductor/test_aot_inductor.py -k test_replicate_on_devices --verbose

				  python test/run_test.py -i distributed/test_c10d_functional_native.py --verbose

				  python test/run_test.py -i distributed/_tensor/test_dtensor_compile.py --verbose

				  python test/run_test.py -i distributed/tensor/test_dtensor_compile.py --verbose

				  python test/run_test.py -i distributed/tensor/parallel/test_micro_pipeline_tp.py --verbose

				  python test/run_test.py -i distributed/_composable/test_replicate_with_compiler.py --verbose

				  python test/run_test.py -i distributed/_composable/fsdp/test_fully_shard_comm.py --verbose

				  python test/run_test.py -i distributed/_composable/fsdp/test_fully_shard_training.py -k test_train_parity_multi_group --verbose

				  python test/run_test.py -i distributed/_composable/fsdp/test_fully_shard_training.py -k test_train_parity_with_activation_checkpointing --verbose

				@ -331,11 +343,12 @@ test_inductor_distributed() {

				  python test/run_test.py -i distributed/_composable/fsdp/test_fully_shard_mixed_precision.py -k test_compute_dtype --verbose

				  python test/run_test.py -i distributed/_composable/fsdp/test_fully_shard_mixed_precision.py -k test_reduce_dtype --verbose

				  python test/run_test.py -i distributed/_composable/fsdp/test_fully_shard_clip_grad_norm_.py -k test_clip_grad_norm_2d --verbose

				  python test/run_test.py -i distributed/_composable/fsdp/test_fully_shard_compile.py --verbose

				  python test/run_test.py -i distributed/fsdp/test_fsdp_tp_integration.py -k test_fsdp_tp_integration --verbose

				  # this runs on both single-gpu and multi-gpu instance. It should be smart about skipping tests that aren't supported

				  # with if required # gpus aren't available

				  python test/run_test.py --include distributed/test_dynamo_distributed distributed/test_inductor_collectives --verbose

				  python test/run_test.py --include distributed/test_dynamo_distributed distributed/test_inductor_collectives distributed/test_compute_comm_reordering --verbose

				  assert_git_not_dirty

				}

				@ -369,22 +382,53 @@ test_inductor_aoti() {

				  CPP_TESTS_DIR="${BUILD_BIN_DIR}" LD_LIBRARY_PATH="${TORCH_LIB_DIR}" python test/run_test.py --cpp --verbose -i cpp/test_aoti_abi_check cpp/test_aoti_inference

				}

				test_inductor_cpp_wrapper_abi_compatible() {

				  export TORCHINDUCTOR_ABI_COMPATIBLE=1

				test_inductor_cpp_wrapper_shard() {

				  if [[ -z "$NUM_TEST_SHARDS" ]]; then

				    echo "NUM_TEST_SHARDS must be defined to run a Python test shard"

				    exit 1

				  fi

				  export TORCHINDUCTOR_CPP_WRAPPER=1

				  TEST_REPORTS_DIR=$(pwd)/test/test-reports

				  mkdir -p "$TEST_REPORTS_DIR"

				  echo "Testing Inductor cpp wrapper mode with TORCHINDUCTOR_ABI_COMPATIBLE=1"

				  # cpu stack allocation causes segfault and needs more investigation

				  PYTORCH_TESTING_DEVICE_ONLY_FOR="" python test/run_test.py --include inductor/test_cpu_cpp_wrapper

				  python test/run_test.py --include inductor/test_cuda_cpp_wrapper

				  if [[ "$1" -eq "2" ]]; then

				    # For now, manually put the opinfo tests in shard 2, and all other tests in

				    # shard 1.  Test specific things triggering past bugs, for now.

				    python test/run_test.py \

				      --include inductor/test_torchinductor_opinfo \

				      -k 'linalg or to_sparse' \

				      --verbose

				    exit

				  fi

				  TORCHINDUCTOR_CPP_WRAPPER=1 python benchmarks/dynamo/timm_models.py --device cuda --accuracy --amp \

				  # Run certain inductor unit tests with cpp wrapper. In the end state, we

				  # should be able to run all the inductor unit tests with cpp_wrapper.

				  python test/run_test.py --include inductor/test_torchinductor --verbose

				  # Run inductor benchmark tests with cpp wrapper.

				  # Skip benchmark tests if it's in rerun-disabled-mode.

				  if [[ "${PYTORCH_TEST_RERUN_DISABLED_TESTS}" == "1" ]]; then

				    echo "skip dynamo benchmark tests for rerun-disabled-test"

				  else

				    echo "run dynamo benchmark tests with cpp wrapper"

				    python benchmarks/dynamo/timm_models.py --device cuda --accuracy --amp \

				    --training --inductor --disable-cudagraphs --only vit_base_patch16_224 \

				    --output "$TEST_REPORTS_DIR/inductor_cpp_wrapper_training.csv"

				  python benchmarks/dynamo/check_accuracy.py \

				    --actual "$TEST_REPORTS_DIR/inductor_cpp_wrapper_training.csv" \

				    --expected "benchmarks/dynamo/ci_expected_accuracy/inductor_timm_training.csv"

				    python benchmarks/dynamo/check_accuracy.py \

				      --actual "$TEST_REPORTS_DIR/inductor_cpp_wrapper_training.csv" \

				      --expected "benchmarks/dynamo/ci_expected_accuracy/inductor_timm_training.csv"

				    python benchmarks/dynamo/torchbench.py --device cuda --accuracy \

				      --bfloat16 --inference --inductor --only hf_T5 --output "$TEST_REPORTS_DIR/inductor_cpp_wrapper_inference.csv"

				    python benchmarks/dynamo/torchbench.py --device cuda --accuracy \

				      --bfloat16 --inference --inductor --only llama --output "$TEST_REPORTS_DIR/inductor_cpp_wrapper_inference.csv"

				    python benchmarks/dynamo/torchbench.py --device cuda --accuracy \

				      --bfloat16 --inference --inductor --only moco --output "$TEST_REPORTS_DIR/inductor_cpp_wrapper_inference.csv"

				    python benchmarks/dynamo/check_accuracy.py \

				      --actual "$TEST_REPORTS_DIR/inductor_cpp_wrapper_inference.csv" \

				      --expected "benchmarks/dynamo/ci_expected_accuracy/inductor_torchbench_inference.csv"

				  fi

				}

				# "Global" flags for inductor benchmarking controlled by TEST_CONFIG

				@ -401,10 +445,10 @@ pr_time_benchmarks() {

				  TEST_REPORTS_DIR=$(pwd)/test/test-reports

				  mkdir -p "$TEST_REPORTS_DIR"

				  PYTHONPATH=$(pwd)/benchmarks/dynamo/pr_time_benchmarks source benchmarks/dynamo/pr_time_benchmarks/benchmark_runner.sh "$TEST_REPORTS_DIR/pr_time_benchmarks_after.txt" "benchmarks/dynamo/pr_time_benchmarks/benchmarks"

				  PYTHONPATH=$(pwd)/benchmarks/dynamo/pr_time_benchmarks source benchmarks/dynamo/pr_time_benchmarks/benchmark_runner.sh "$TEST_REPORTS_DIR/pr_time_benchmarks_results.csv" "benchmarks/dynamo/pr_time_benchmarks/benchmarks"

				  echo "benchmark results on current PR: "

				  cat  "$TEST_REPORTS_DIR/pr_time_benchmarks_after.txt"

				  cat  "$TEST_REPORTS_DIR/pr_time_benchmarks_results.csv"

				  PYTHONPATH=$(pwd)/benchmarks/dynamo/pr_time_benchmarks python benchmarks/dynamo/pr_time_benchmarks/check_results.py "benchmarks/dynamo/pr_time_benchmarks/expected_results.csv" "$TEST_REPORTS_DIR/pr_time_benchmarks_results.csv" "$TEST_REPORTS_DIR/new_expected_results.csv"

				}

				if [[ "${TEST_CONFIG}" == *pr_time_benchmarks* ]]; then

				@ -512,7 +556,7 @@ test_perf_for_dashboard() {

				              "${target_flag[@]}" --"$mode" --"$dtype" --export --disable-cudagraphs "$@" \

				              --output "$TEST_REPORTS_DIR/${backend}_export_${suite}_${dtype}_${mode}_${device}_${target}.csv"

				        fi

				        TORCHINDUCTOR_ABI_COMPATIBLE=1 $TASKSET python "benchmarks/dynamo/$suite.py" \

				        $TASKSET python "benchmarks/dynamo/$suite.py" \

				            "${target_flag[@]}" --"$mode" --"$dtype" --export-aot-inductor --disable-cudagraphs "$@" \

				            --output "$TEST_REPORTS_DIR/${backend}_aot_inductor_${suite}_${dtype}_${mode}_${device}_${target}.csv"

				      fi

				@ -567,13 +611,6 @@ test_single_dynamo_benchmark() {

				    test_perf_for_dashboard "$suite" \

				      "${DYNAMO_BENCHMARK_FLAGS[@]}" "$@" "${partition_flags[@]}"

				  else

				    if [[ "${TEST_CONFIG}" == *aot_inductor* && "${TEST_CONFIG}" != *cpu_aot_inductor* ]]; then

				      # Test AOTInductor with the ABI-compatible mode on CI

				      # This can be removed once the ABI-compatible mode becomes default.

				      # For CPU device, we perfer non ABI-compatible mode on CI when testing AOTInductor.

				      export TORCHINDUCTOR_ABI_COMPATIBLE=1

				    fi

				    if [[ "${TEST_CONFIG}" == *_avx2* ]]; then

				      TEST_CONFIG=${TEST_CONFIG//_avx2/}

				    fi

				@ -595,7 +632,15 @@ test_single_dynamo_benchmark() {

				}

				test_inductor_micro_benchmark() {

				  # torchao requires cuda 8.0 or above for bfloat16 support

				  if [[ "$BUILD_ENVIRONMENT" == *cuda* ]]; then

				    export TORCH_CUDA_ARCH_LIST="8.0;8.6"

				  fi

				  install_torchao

				  TEST_REPORTS_DIR=$(pwd)/test/test-reports

				  if [[ "${TEST_CONFIG}" == *cpu* ]]; then

				    test_inductor_set_cpu_affinity

				  fi

				  python benchmarks/gpt_fast/benchmark.py --output "${TEST_REPORTS_DIR}/gpt_fast_benchmark.csv"

				}

				@ -604,6 +649,11 @@ test_inductor_halide() {

				  assert_git_not_dirty

				}

				test_inductor_triton_cpu() {

				  python test/run_test.py --include inductor/test_triton_cpu_backend.py --verbose

				  assert_git_not_dirty

				}

				test_dynamo_benchmark() {

				  # Usage: test_dynamo_benchmark huggingface 0

				  TEST_REPORTS_DIR=$(pwd)/test/test-reports

				@ -641,32 +691,12 @@ test_inductor_torchbench_smoketest_perf() {

				  TEST_REPORTS_DIR=$(pwd)/test/test-reports

				  mkdir -p "$TEST_REPORTS_DIR"

				  # Test some models in the cpp wrapper mode

				  TORCHINDUCTOR_ABI_COMPATIBLE=1 TORCHINDUCTOR_CPP_WRAPPER=1 python benchmarks/dynamo/torchbench.py --device cuda --accuracy \

				    --bfloat16 --inference --inductor --only hf_T5 --output "$TEST_REPORTS_DIR/inductor_cpp_wrapper_inference.csv"

				  TORCHINDUCTOR_ABI_COMPATIBLE=1 TORCHINDUCTOR_CPP_WRAPPER=1 python benchmarks/dynamo/torchbench.py --device cuda --accuracy \

				    --bfloat16 --inference --inductor --only llama --output "$TEST_REPORTS_DIR/inductor_cpp_wrapper_inference.csv"

				  TORCHINDUCTOR_ABI_COMPATIBLE=1 TORCHINDUCTOR_CPP_WRAPPER=1 python benchmarks/dynamo/torchbench.py --device cuda --accuracy \

				    --bfloat16 --inference --inductor --only moco --output "$TEST_REPORTS_DIR/inductor_cpp_wrapper_inference.csv"

				  python benchmarks/dynamo/check_accuracy.py \

				    --actual "$TEST_REPORTS_DIR/inductor_cpp_wrapper_inference.csv" \

				    --expected "benchmarks/dynamo/ci_expected_accuracy/inductor_torchbench_inference.csv"

				  python benchmarks/dynamo/torchbench.py --device cuda --performance --backend inductor --float16 --training \

				    --batch-size-file "$(realpath benchmarks/dynamo/torchbench_models_list.txt)" --only hf_Bert \

				    --output "$TEST_REPORTS_DIR/inductor_training_smoketest.csv"

				  # The threshold value needs to be actively maintained to make this check useful

				  python benchmarks/dynamo/check_perf_csv.py -f "$TEST_REPORTS_DIR/inductor_training_smoketest.csv" -t 1.4

				  TORCHINDUCTOR_ABI_COMPATIBLE=1 python benchmarks/dynamo/torchbench.py --device cuda --performance --bfloat16 --inference \

				    --export-aot-inductor --only nanogpt --output "$TEST_REPORTS_DIR/inductor_inference_smoketest.csv"

				  # The threshold value needs to be actively maintained to make this check useful

				  # The perf number of nanogpt seems not very stable, e.g.

				  # https://github.com/pytorch/pytorch/actions/runs/7158691360/job/19491437314,

				  # and thus we lower its threshold to reduce flakiness. If this continues to be a problem,

				  # we switch to use some other model.

				  python benchmarks/dynamo/check_perf_csv.py -f "$TEST_REPORTS_DIR/inductor_inference_smoketest.csv" -t 4.9

				  # Check memory compression ratio for a few models

				  for test in hf_Albert timm_vision_transformer; do

				    python benchmarks/dynamo/torchbench.py --device cuda --performance --backend inductor --amp --training \

				@ -710,6 +740,10 @@ test_inductor_set_cpu_affinity(){

				    export KMP_BLOCKTIME=1

				  fi

				  cores=$(test_inductor_get_core_number)

				  # Set number of cores to 16 on Aarch64 for performance runs.

				  if [[ "${TEST_CONFIG}" == *aarch64* && $cores -gt 16 ]]; then

				    cores=16

				  fi

				  export OMP_NUM_THREADS=$cores

				  end_core=$((cores-1))

				  export TASKSET="taskset -c 0-$end_core"

				@ -746,19 +780,9 @@ test_inductor_torchbench_cpu_smoketest_perf(){

				    fi

				    cat "$output_name"

				    # The threshold value needs to be actively maintained to make this check useful.

				    python benchmarks/dynamo/check_perf_csv.py -f "$output_name" -t "$speedup_target"

				    # Allow 1% variance for CPU perf to accommodate perf fluctuation

				    python benchmarks/dynamo/check_perf_csv.py -f "$output_name" -t "$speedup_target" -s 0.99

				  done

				  # Add a few ABI-compatible accuracy tests for CPU. These can be removed once we turn on ABI-compatible as default.

				  TORCHINDUCTOR_ABI_COMPATIBLE=1 python benchmarks/dynamo/timm_models.py --device cpu --accuracy \

				    --bfloat16 --inference --export-aot-inductor --disable-cudagraphs --only adv_inception_v3 \

				    --output "$TEST_REPORTS_DIR/aot_inductor_smoke_test.csv"

				  TORCHINDUCTOR_ABI_COMPATIBLE=1 python benchmarks/dynamo/timm_models.py --device cpu --accuracy \

				    --bfloat16 --inference --export-aot-inductor --disable-cudagraphs --only beit_base_patch16_224 \

				    --output "$TEST_REPORTS_DIR/aot_inductor_smoke_test.csv"

				  python benchmarks/dynamo/check_accuracy.py \

				    --actual "$TEST_REPORTS_DIR/aot_inductor_smoke_test.csv" \

				    --expected "benchmarks/dynamo/ci_expected_accuracy/aot_inductor_timm_inference.csv"

				}

				test_torchbench_gcp_smoketest(){

				@ -816,7 +840,7 @@ test_without_numpy() {

				  # Regression test for https://github.com/pytorch/pytorch/issues/66353

				  python -c "import sys;sys.path.insert(0, 'fake_numpy');import torch;print(torch.tensor([torch.tensor(0.), torch.tensor(1.)]))"

				  # Regression test for https://github.com/pytorch/pytorch/issues/109387

				  if [[ "${TEST_CONFIG}" == *dynamo* ]]; then

				  if [[ "${TEST_CONFIG}" == *dynamo_wrapped* ]]; then

				    python -c "import sys;sys.path.insert(0, 'fake_numpy');import torch;torch.compile(lambda x:print(x))('Hello World')"

				  fi

				  popd

				@ -950,6 +974,9 @@ test_distributed() {

				    python test/run_test.py --cpp --verbose -i cpp/HashStoreTest

				    python test/run_test.py --cpp --verbose -i cpp/TCPStoreTest

				    echo "Testing multi-GPU linalg tests"

				    python test/run_test.py -i test_linalg.py -k test_matmul_offline_mgpu_tunable --verbose

				    if [[ "$BUILD_ENVIRONMENT" == *cuda* ]]; then

				      MPIEXEC=$(command -v mpiexec)

				      if [[ -n "$MPIEXEC" ]]; then

				@ -1199,7 +1226,7 @@ EOF

				  git reset --hard "${SHA_TO_COMPARE}"

				  git submodule sync && git submodule update --init --recursive

				  echo "::group::Installing Torch From Base Commit"

				  pip install -r requirements.txt

				  pip3 install -r requirements.txt

				  # shellcheck source=./common-build.sh

				  source "$(dirname "${BASH_SOURCE[0]}")/common-build.sh"

				  python setup.py bdist_wheel --bdist-dir="base_bdist_tmp" --dist-dir="base_dist"

				@ -1233,7 +1260,7 @@ EOF

				}

				test_bazel() {

				  set -e

				  set -e -o pipefail

				  # bazel test needs sccache setup.

				  # shellcheck source=./common-build.sh

				@ -1356,10 +1383,11 @@ test_executorch() {

				  export EXECUTORCH_BUILD_PYBIND=ON

				  export CMAKE_ARGS="-DEXECUTORCH_BUILD_XNNPACK=ON -DEXECUTORCH_BUILD_KERNELS_QUANTIZED=ON"

				  # For llama3

				  bash examples/models/llama3_2_vision/install_requirements.sh

				  # NB: We need to rebuild ExecuTorch runner here because it depends on PyTorch

				  # from the PR

				  # shellcheck disable=SC1091

				  source .ci/scripts/setup-linux.sh cmake

				  bash .ci/scripts/setup-linux.sh cmake

				  echo "Run ExecuTorch unit tests"

				  pytest -v -n auto

				@ -1369,7 +1397,7 @@ test_executorch() {

				  echo "Run ExecuTorch regression tests for some models"

				  # TODO(huydhn): Add more coverage here using ExecuTorch's gather models script

				  # shellcheck disable=SC1091

				  source .ci/scripts/test.sh mv3 cmake xnnpack-quantization-delegation ''

				  source .ci/scripts/test_model.sh mv3 cmake xnnpack-quantization-delegation ''

				  popd

				@ -1380,14 +1408,17 @@ test_executorch() {

				  assert_git_not_dirty

				}

				test_linux_aarch64(){

				test_linux_aarch64() {

				  python test/run_test.py --include test_modules test_mkldnn test_mkldnn_fusion test_openmp test_torch test_dynamic_shapes \

				       test_transformers test_multiprocessing test_numpy_interop --verbose

				        test_transformers test_multiprocessing test_numpy_interop test_autograd test_binary_ufuncs test_complex test_spectral_ops \

				        test_foreach test_reductions test_unary_ufuncs \

				        --shard "$SHARD_NUMBER" "$NUM_TEST_SHARDS" --verbose

				  # Dynamo tests

				  python test/run_test.py --include dynamo/test_compile dynamo/test_backends dynamo/test_comptime dynamo/test_config \

				       dynamo/test_functions dynamo/test_fx_passes_pre_grad dynamo/test_interop dynamo/test_model_output dynamo/test_modules \

				       dynamo/test_optimizers dynamo/test_recompile_ux dynamo/test_recompiles --verbose

				       dynamo/test_optimizers dynamo/test_recompile_ux dynamo/test_recompiles \

				       --shard "$SHARD_NUMBER" "$NUM_TEST_SHARDS" --verbose

				  # Inductor tests

				  python test/run_test.py --include inductor/test_torchinductor inductor/test_benchmark_fusion inductor/test_codecache \

				@ -1397,14 +1428,20 @@ test_linux_aarch64(){

				       inductor/test_max_autotune inductor/test_memory_planning inductor/test_metrics inductor/test_multi_kernel inductor/test_pad_mm \

				       inductor/test_pattern_matcher inductor/test_perf inductor/test_profiler inductor/test_select_algorithm inductor/test_smoke \

				       inductor/test_split_cat_fx_passes inductor/test_standalone_compile inductor/test_torchinductor \

				       inductor/test_torchinductor_codegen_dynamic_shapes inductor/test_torchinductor_dynamic_shapes --verbose

				       inductor/test_torchinductor_codegen_dynamic_shapes inductor/test_torchinductor_dynamic_shapes inductor/test_memory \

				       inductor/test_triton_cpu_backend inductor/test_triton_extension_backend inductor/test_mkldnn_pattern_matcher inductor/test_cpu_cpp_wrapper \

				       --shard "$SHARD_NUMBER" "$NUM_TEST_SHARDS" --verbose

				}

				if ! [[ "${BUILD_ENVIRONMENT}" == *libtorch* || "${BUILD_ENVIRONMENT}" == *-bazel-* ]]; then

				  (cd test && python -c "import torch; print(torch.__config__.show())")

				  (cd test && python -c "import torch; print(torch.__config__.parallel_info())")

				fi

				if [[ "${BUILD_ENVIRONMENT}" == *aarch64* && "${TEST_CONFIG}" != *perf_cpu_aarch64* ]]; then

				if [[ "${TEST_CONFIG}" == *numpy_2* ]]; then

				  # Install numpy-2.0.2 and compatible scipy & numba versions

				  python -mpip install --pre numpy==2.0.2 scipy==1.13.1 numba==0.60.0

				  python test/run_test.py --include dynamo/test_functions.py dynamo/test_unspec.py test_binary_ufuncs.py test_fake_tensor.py test_linalg.py test_numpy_interop.py test_tensor_creation_ops.py test_torch.py torch_np/test_basic.py

				elif [[ "${BUILD_ENVIRONMENT}" == *aarch64* && "${TEST_CONFIG}" != *perf_cpu_aarch64* ]]; then

				  test_linux_aarch64

				elif [[ "${TEST_CONFIG}" == *backward* ]]; then

				  test_forward_backward_compatibility

				@ -1430,6 +1467,8 @@ elif [[ "${TEST_CONFIG}" == *inductor_distributed* ]]; then

				  test_inductor_distributed

				elif [[ "${TEST_CONFIG}" == *inductor-halide* ]]; then

				  test_inductor_halide

				elif [[ "${TEST_CONFIG}" == *inductor-triton-cpu* ]]; then

				  test_inductor_triton_cpu

				elif [[ "${TEST_CONFIG}" == *inductor-micro-benchmark* ]]; then

				  test_inductor_micro_benchmark

				elif [[ "${TEST_CONFIG}" == *huggingface* ]]; then

				@ -1446,14 +1485,13 @@ elif [[ "${TEST_CONFIG}" == *torchbench* ]]; then

				  else

				    install_torchaudio cuda

				  fi

				  install_torchtext

				  install_torchvision

				  TORCH_CUDA_ARCH_LIST="8.0;8.6" pip_install git+https://github.com/pytorch/ao.git

				  id=$((SHARD_NUMBER-1))

				  # https://github.com/opencv/opencv-python/issues/885

				  pip_install opencv-python==4.8.0.74

				  if [[ "${TEST_CONFIG}" == *inductor_torchbench_smoketest_perf* ]]; then

				    checkout_install_torchbench hf_Bert hf_Albert nanogpt timm_vision_transformer

				    checkout_install_torchbench hf_Bert hf_Albert timm_vision_transformer

				    PYTHONPATH=$(pwd)/torchbench test_inductor_torchbench_smoketest_perf

				  elif [[ "${TEST_CONFIG}" == *inductor_torchbench_cpu_smoketest_perf* ]]; then

				    checkout_install_torchbench timm_vision_transformer phlippe_densenet basic_gnn_edgecnn \

				@ -1472,9 +1510,11 @@ elif [[ "${TEST_CONFIG}" == *torchbench* ]]; then

				    fi

				    PYTHONPATH=$(pwd)/torchbench test_dynamo_benchmark torchbench "$id"

				  fi

				elif [[ "${TEST_CONFIG}" == *inductor_cpp_wrapper_abi_compatible* ]]; then

				elif [[ "${TEST_CONFIG}" == *inductor_cpp_wrapper* ]]; then

				  install_torchaudio cuda

				  install_torchvision

				  test_inductor_cpp_wrapper_abi_compatible

				  checkout_install_torchbench hf_T5 llama moco

				  PYTHONPATH=$(pwd)/torchbench test_inductor_cpp_wrapper_shard "$SHARD_NUMBER"

				elif [[ "${TEST_CONFIG}" == *inductor* ]]; then

				  install_torchvision

				  test_inductor_shard "${SHARD_NUMBER}"

				@ -1483,9 +1523,9 @@ elif [[ "${TEST_CONFIG}" == *inductor* ]]; then

				      test_inductor_distributed

				    fi

				  fi

				elif [[ "${TEST_CONFIG}" == *dynamo* ]]; then

				elif [[ "${TEST_CONFIG}" == *dynamo_wrapped* ]]; then

				  install_torchvision

				  test_dynamo_shard "${SHARD_NUMBER}"

				  test_dynamo_wrapped_shard "${SHARD_NUMBER}"

				  if [[ "${SHARD_NUMBER}" == 1 ]]; then

				    test_aten

				  fi

									
										26

.ci/pytorch/test_example_code/CMakeLists.txt
									
										Normal file
									
												View File
												
				@ -0,0 +1,26 @@

				cmake_minimum_required(VERSION 3.0 FATAL_ERROR)

				project(simple-torch-test)

				find_package(Torch REQUIRED)

				set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} ${TORCH_CXX_FLAGS}")

				add_executable(simple-torch-test simple-torch-test.cpp)

				target_include_directories(simple-torch-test PRIVATE  ${TORCH_INCLUDE_DIRS})

				target_link_libraries(simple-torch-test "${TORCH_LIBRARIES}")

				set_property(TARGET simple-torch-test PROPERTY CXX_STANDARD 17)

				find_package(CUDAToolkit 11.8)

				target_link_libraries(simple-torch-test CUDA::cudart CUDA::cufft CUDA::cusparse CUDA::cublas CUDA::cusolver)

				find_library(CUDNN_LIBRARY NAMES cudnn)

				target_link_libraries(simple-torch-test  ${CUDNN_LIBRARY} )

				if(MSVC)

				  file(GLOB TORCH_DLLS  "$ENV{CUDA_PATH}/bin/cudnn64_8.dll" "$ENV{NVTOOLSEXT_PATH}/bin/x64/*.dll")

				  message("dlls to copy "  ${TORCH_DLLS})

				  add_custom_command(TARGET simple-torch-test

				                     POST_BUILD

				                     COMMAND ${CMAKE_COMMAND} -E copy_if_different

				                     ${TORCH_DLLS}

				                     $<TARGET_FILE_DIR:simple-torch-test>)

				endif(MSVC)

Compare commits

4895 Commits cslpull88 ... Update-Fla

2 .bazelversion Unescape Escape View File

26 .buckconfig.oss Unescape Escape View File

19 .ci/aarch64_linux/README.md Normal file Unescape Escape View File

26 .ci/aarch64_linux/aarch64_ci_build.sh Normal file Unescape Escape View File

23 .ci/aarch64_linux/aarch64_ci_setup.sh Executable file Unescape Escape View File

230 .ci/aarch64_linux/aarch64_wheel_ci_build.py Executable file Unescape Escape View File

1043 .ci/aarch64_linux/build_aarch64_wheel.py Executable file View File

87 .ci/aarch64_linux/embed_library.py Normal file Unescape Escape View File

58 .ci/docker/conda/Dockerfile → .ci/docker/almalinux/Dockerfile Unescape Escape View File

10 .ci/docker/conda/build.sh → .ci/docker/almalinux/build.sh Unescape Escape View File

1 .ci/docker/android/AndroidManifest.xml Unescape Escape View File

66 .ci/docker/android/build.gradle Unescape Escape View File

8 .ci/docker/aotriton_version.txt Unescape Escape View File

95 .ci/docker/build.sh Unescape Escape View File

4 .ci/docker/centos-rocm/Dockerfile Unescape Escape View File

2 .ci/docker/ci_commit_pins/executorch.txt Unescape Escape View File

1 .ci/docker/ci_commit_pins/triton-cpu.txt Normal file Unescape Escape View File

1 .ci/docker/ci_commit_pins/triton-rocm.txt Unescape Escape View File

2 .ci/docker/ci_commit_pins/triton-xpu.txt Unescape Escape View File

2 .ci/docker/ci_commit_pins/triton.txt Unescape Escape View File

112 .ci/docker/common/install_android.sh Unescape Escape View File

4 .ci/docker/common/install_aotriton.sh Unescape Escape View File

3 .ci/docker/common/install_base.sh Unescape Escape View File

50 .ci/docker/common/install_cache.sh Unescape Escape View File

11 .ci/docker/common/install_clang.sh Unescape Escape View File

25 .ci/docker/common/install_conda.sh Unescape Escape View File

24 .ci/docker/common/install_cpython.sh Unescape Escape View File

88 .ci/docker/common/install_cuda.sh Unescape Escape View File

106 .ci/docker/common/install_cuda_aarch64.sh Unescape Escape View File

4 .ci/docker/common/install_cudnn.sh Unescape Escape View File

2 .ci/docker/common/install_cusparselt.sh Unescape Escape View File

12 .ci/docker/common/install_executorch.sh Unescape Escape View File

10 .ci/docker/common/install_inductor_benchmark_deps.sh Unescape Escape View File

4 .ci/docker/common/install_magma.sh Unescape Escape View File

26 .ci/docker/common/install_magma_conda.sh Executable file Unescape Escape View File

102 .ci/docker/common/install_miopen.sh Unescape Escape View File

2 .ci/docker/common/install_onnx.sh Unescape Escape View File

2 .ci/docker/common/install_openblas.sh Unescape Escape View File

2 .ci/docker/common/install_rocm_drm.sh Unescape Escape View File

12 .ci/docker/common/install_rocm_magma.sh Unescape Escape View File

13 .ci/docker/common/install_triton.sh Unescape Escape View File

7 .ci/docker/common/install_user.sh Unescape Escape View File

42 .ci/docker/common/install_xpu.sh Unescape Escape View File

5 .ci/docker/libtorch/Dockerfile Unescape Escape View File

12 .ci/docker/libtorch/build.sh Unescape Escape View File

3 .ci/docker/linter-cuda/Dockerfile Unescape Escape View File

5 .ci/docker/manywheel/Dockerfile Unescape Escape View File

47 .ci/docker/manywheel/Dockerfile_2_28 Unescape Escape View File

7 .ci/docker/manywheel/Dockerfile_2_28_aarch64 Unescape Escape View File

2 .ci/docker/manywheel/Dockerfile_aarch64 Unescape Escape View File

99 .ci/docker/manywheel/Dockerfile_s390x Unescape Escape View File

33 .ci/docker/manywheel/build.sh Unescape Escape View File

61 .ci/docker/manywheel/build_scripts/build.sh Unescape Escape View File

14 .ci/docker/manywheel/build_scripts/ssl-check.py Unescape Escape View File

83 .ci/docker/requirements-ci.txt Unescape Escape View File

3 .ci/docker/requirements-docs.txt Unescape Escape View File

2 .ci/docker/triton_version.txt Unescape Escape View File

5 .ci/docker/ubuntu-cuda/Dockerfile Unescape Escape View File

15 .ci/docker/ubuntu-rocm/Dockerfile Unescape Escape View File

23 .ci/docker/ubuntu/Dockerfile Unescape Escape View File

10 .ci/libtorch/build.sh Normal file Unescape Escape View File

2 .ci/magma/.gitignore vendored Normal file Unescape Escape View File

48 .ci/magma/Makefile Normal file Unescape Escape View File

50 .ci/magma/README.md Normal file Unescape Escape View File

50 .ci/magma/build_magma.sh Executable file Unescape Escape View File

40 .ci/magma/package_files/CMake.patch Normal file Unescape Escape View File

12 .ci/magma/package_files/build.sh Executable file Unescape Escape View File

388 .ci/magma/package_files/cmakelists.patch Normal file Unescape Escape View File

40 .ci/magma/package_files/getrf_nbparam.patch Normal file Unescape Escape View File

15 .ci/magma/package_files/getrf_shfl.patch Normal file Unescape Escape View File

1 .ci/magma/package_files/magma-2.6.1.sha256 Normal file Unescape Escape View File

20 .ci/magma/package_files/thread_queue.patch Normal file Unescape Escape View File

21 .ci/manywheel/LICENSE Normal file Unescape Escape View File

28 .ci/manywheel/build.sh Executable file Unescape Escape View File

498 .ci/manywheel/build_common.sh Normal file Unescape Escape View File

60 .ci/manywheel/build_cpu.sh Executable file Unescape Escape View File

292 .ci/manywheel/build_cuda.sh Normal file Unescape Escape View File

353 .ci/manywheel/build_libtorch.sh Normal file Unescape Escape View File

4895 Commits

cslpull88 ... Update-Fla

2

.bazelversion

View File

26

.buckconfig.oss

View File

19

.ci/aarch64_linux/README.md Normal file

View File

26

.ci/aarch64_linux/aarch64_ci_build.sh Normal file

View File

23

.ci/aarch64_linux/aarch64_ci_setup.sh Executable file

View File

230

.ci/aarch64_linux/aarch64_wheel_ci_build.py Executable file

View File

1043

.ci/aarch64_linux/build_aarch64_wheel.py Executable file

View File

87

.ci/aarch64_linux/embed_library.py Normal file

View File

58

.ci/docker/conda/Dockerfile → .ci/docker/almalinux/Dockerfile

View File

10

.ci/docker/conda/build.sh → .ci/docker/almalinux/build.sh

View File

1

.ci/docker/android/AndroidManifest.xml

View File

66

.ci/docker/android/build.gradle

View File

8

.ci/docker/aotriton_version.txt

View File

95

.ci/docker/build.sh

View File

4

.ci/docker/centos-rocm/Dockerfile

View File

2

.ci/docker/ci_commit_pins/executorch.txt

View File

1

.ci/docker/ci_commit_pins/triton-cpu.txt Normal file

View File

1

.ci/docker/ci_commit_pins/triton-rocm.txt

View File

2

.ci/docker/ci_commit_pins/triton-xpu.txt

View File

2

.ci/docker/ci_commit_pins/triton.txt

View File

112

.ci/docker/common/install_android.sh

View File

4

.ci/docker/common/install_aotriton.sh

View File

3

.ci/docker/common/install_base.sh

View File

50

.ci/docker/common/install_cache.sh

View File

11

.ci/docker/common/install_clang.sh

View File

25

.ci/docker/common/install_conda.sh

View File

24

.ci/docker/common/install_cpython.sh

View File

88

.ci/docker/common/install_cuda.sh

View File

106

.ci/docker/common/install_cuda_aarch64.sh

View File

4

.ci/docker/common/install_cudnn.sh

View File

2

.ci/docker/common/install_cusparselt.sh

View File

12

.ci/docker/common/install_executorch.sh

View File

10

.ci/docker/common/install_inductor_benchmark_deps.sh

View File

4

.ci/docker/common/install_magma.sh

View File

26

.ci/docker/common/install_magma_conda.sh Executable file

View File

102

.ci/docker/common/install_miopen.sh

View File

2

.ci/docker/common/install_onnx.sh

View File

2

.ci/docker/common/install_openblas.sh

View File

2

.ci/docker/common/install_rocm_drm.sh

View File

12

.ci/docker/common/install_rocm_magma.sh

View File

13

.ci/docker/common/install_triton.sh

View File

7

.ci/docker/common/install_user.sh

View File

42

.ci/docker/common/install_xpu.sh

View File

5

.ci/docker/libtorch/Dockerfile

View File

12

.ci/docker/libtorch/build.sh

View File

3

.ci/docker/linter-cuda/Dockerfile

View File

5

.ci/docker/manywheel/Dockerfile

View File

47

.ci/docker/manywheel/Dockerfile_2_28

View File

7

.ci/docker/manywheel/Dockerfile_2_28_aarch64

View File

2

.ci/docker/manywheel/Dockerfile_aarch64

View File

99

.ci/docker/manywheel/Dockerfile_s390x

View File

33

.ci/docker/manywheel/build.sh

View File

61

.ci/docker/manywheel/build_scripts/build.sh

View File

14

.ci/docker/manywheel/build_scripts/ssl-check.py

View File

83

.ci/docker/requirements-ci.txt

View File

3

.ci/docker/requirements-docs.txt

View File

2

.ci/docker/triton_version.txt

View File

5

.ci/docker/ubuntu-cuda/Dockerfile

View File

15

.ci/docker/ubuntu-rocm/Dockerfile

View File

23

.ci/docker/ubuntu/Dockerfile

View File

10

.ci/libtorch/build.sh Normal file

View File

2

.ci/magma/.gitignore vendored Normal file

View File

48

.ci/magma/Makefile Normal file

View File

50

.ci/magma/README.md Normal file

View File

50

.ci/magma/build_magma.sh Executable file

View File

40

.ci/magma/package_files/CMake.patch Normal file

View File

12

.ci/magma/package_files/build.sh Executable file

View File

388

.ci/magma/package_files/cmakelists.patch Normal file

View File

40

.ci/magma/package_files/getrf_nbparam.patch Normal file

View File

15

.ci/magma/package_files/getrf_shfl.patch Normal file

View File

1

.ci/magma/package_files/magma-2.6.1.sha256 Normal file

View File

20

.ci/magma/package_files/thread_queue.patch Normal file

View File

21

.ci/manywheel/LICENSE Normal file

View File

28

.ci/manywheel/build.sh Executable file

View File

498

.ci/manywheel/build_common.sh Normal file

View File

60

.ci/manywheel/build_cpu.sh Executable file

View File

292

.ci/manywheel/build_cuda.sh Normal file

View File

353

.ci/manywheel/build_libtorch.sh Normal file

View File

291

.ci/manywheel/build_rocm.sh Executable file

View File